A lot gets written about data quality, and very little gets measured. The German-language practitioner’s standard reference alone lists sixty possible quality criteria — from timeliness to reliability —, and even the lean models still arrive at six to fifteen dimensions. Yet the core of the matter is surprisingly tangible: a data error caught during loading costs you a log entry. The same error running through to the report costs a wrong invoice, a bad decision — and the trust in every number that follows. This article makes the underlying framework tangible, without a theory slog: it sorts the common data errors into classes, maps them to the established dimensions of data quality — and uses an honest coverage map to show which of them three generic SQL routines actually cover. And it names the two they cannot reach.
The essentials up front:
- Data errors sort into a handful of classes — technical vs. business, and field-level vs. record-level vs. relationship-level. That is enough to place any concrete error.
- The dimensions of data quality are established canon, not taste — defined criteria in the German-language standard reference by Apel et al., internationally in Wang & Strong, DAMA, and ISO/IEC 25012.
- Generic SQL checks four criteria directly in the database: completeness, validity, uniqueness, and referential integrity/consistency. Two stay out of reach — content accuracy and timeliness; both need information from outside the database.
- Checking early is cheap, repairing late is expensive: prevention at the source cuts total costs by roughly two thirds (Redman, as cited by Apel et al.) — and 100 % data quality is not the economic target anyway.
Prerequisites: none. This article provides the framing before the SQL practice begins; the theory part needs no SQL knowledge. If you want the how right away, head for the framework article Data quality checks with SQL.
Contents
- What data errors look like: a classification
- The dimensions of data quality
- What can SQL actually check?
- The two blind spots
- What bad data costs
- From theory to practice
- FAQ
- Related Articles
What data errors look like: a classification
Before slicing data quality into dimensions, it helps to look at the errors themselves. They can be sorted along two axes, and together the two axes cover practically every case. One thing up front, for honesty’s sake: the distinctions themselves — syntactic versus semantic, field versus record versus relationship — are common in the literature; their arrangement into a two-axis matrix is this article’s working classification, not an established canon. For practice, it is all you need.
The first axis is technical versus business. A technical (syntactic) error violates the form: letters in a numeric column, a date that is no date, a foreign key pointing nowhere. You can spot these errors without any domain knowledge — they break the rules of the data type or the model. A business (semantic) error, by contrast, is formally flawless and still wrong: an age of 200, a ship date before the order date, a customer flagged as “premium” without a single order. Only domain knowledge makes the error visible.
The second axis is the scope of the error:
- field-level — the error sits in a single value (
age = 200,"abc"in a numeric column). - record-level — the error only emerges from the interplay of several fields in one row (
ship_date < order_date) or between rows of one table (two records sharing the same key). - relationship-level — the error lies in the relationship between tables (a reference whose target does not exist).
The two axes combine into a compact taxonomy in which every concrete error gets a place and a responsible check routine:
| Scope | technical / syntactic | business / semantic |
|---|---|---|
| field-level | "abc" in a numeric column, an invalid date → validity | age = 200, a negative price → validity (business rule) |
| record-level | two records with the same primary key → uniqueness | ship_date < order_date → consistency (cross-field) |
| relationship-level | foreign key pointing to a missing master → consistency (integrity) | “premium” customer without an order → business rule |
The cells name the dominant criterion — in practice, one error often violates several at once: a foreign key pointing nowhere is also a completeness problem, and a negative price is invalid or simply untrue, depending on how you read it.
One special case falls outside the grid: the value that is missing. An empty mandatory field is neither technically nor semantically malformed — it simply is not there. A criterion of its own is responsible for it, completeness — it cannot be pinned to any single cell but cuts across the whole matrix. Two more kinds of error are absent from the matrix entirely — the value that is well-formed and still factually wrong, and the value that is outdated. We will come back to both; they are the blind spots.
The dimensions of data quality
The classification says what an error looks like. The dimensions of data quality say which property of good data it violates. There is no shortage of theory here — rather a surplus, and that is why the term so often stays fuzzy.
The German-language practitioner’s standard reference on data quality in business intelligence projects is Datenqualität erfolgreich steuern (“Managing Data Quality Successfully”) by Apel, Behme, Eberlein, and Merighi (3rd edition, Edition TDWI). Following Würthele, it defines data quality as a “multidimensional measure of the suitability of data to fulfill the purpose tied to its capture/generation” — a suitability that can change over time as needs change (all book quotes translated from the German original). Two things are already baked into this definition: quality is multidimensional, and it is purpose-bound. There is no absolute “good”, only a “good enough for this purpose”.
How many dimensions there are is a matter of definition. The book starts with an alphabetical catalog of sixty possible quality criteria and narrows it down to the practically viable ones; for the business intelligence context it highlights six — accuracy, consistency, reliability, completeness, timeliness, and relevance. For this article, the criteria that matter are the ones data errors can technically be pinned to:
| Criterion | What it demands (per Apel et al.) |
|---|---|
| Completeness | The attributes are populated with values that “semantically differ from the value NULL (unknown)”; no data gets lost in transformations. |
| Validity (in the book: formal accuracy + uniformity) | The values arrive in the predefined format and are represented uniformly — in practice, the allowed value range belongs here too. |
| Uniqueness (in the book: freedom from redundancy + key uniqueness) | No record describes the same real-world entity twice; the business key occurs only as often as it may. |
| Referential integrity / consistency | Every foreign key uniquely references an existing primary key; values do not contradict each other — within a record, between records, across applications. |
| Accuracy (content) | The values match the entities of the real world — the data corresponds to reality. |
| Timeliness | The records reflect the current state of the modeled world and are not outdated. |
Two names in this table each bundle two of the book’s criteria — the book slices finer than international usage does. Validity does appear in the catalog, but the actual definition lives in the formal component of accuracy (delivery in the predefined data format) and in uniformity. And what DAMA and ISO call uniqueness, the book lists as freedom from redundancy and key uniqueness; its own criterion named “uniqueness” means something else there, namely the unambiguous interpretability of a record through its metadata. Key uniqueness the book phrases in terms of primary keys — yet the check is only meaningful on the business key, because a constraint-enforced primary key cannot occur twice in the first place. The check gets interesting exactly where the constraint is (still) missing: in staging tables and at interfaces. This article sticks to the common names and means the book criteria listed above.
That several such lists exist is not a contradiction but a method: group the criteria and you get a quality model. The book presents two side by side — a theoretical taxonomy after Hinrichs, and the categorization of the German Society for Information and Data Quality (DGIQ), derived from a user survey and in turn based on the much-cited study by Wang and Strong (1996) with its fifteen dimensions. Internationally, the six dimensions of DAMA UK (2013) and the ISO/IEC 25012 standard are widely used as well. They are different cuts through the same subject — which one fits depends on the purpose. For the rest of this article we stay with the book’s criteria, because they point most directly at an SQL check.
What can SQL actually check?
Now for the decisive question: how many of these criteria does a handful of generic SQL checks actually reach? The practical framework of the SQL check series builds on three routines — a value-based WHERE check, a duplicate check, and a foreign-key check. Plot them against the criteria and you get this coverage map:
| Criterion | verifiable with SQL? | Routine |
|---|---|---|
| Completeness | ✅ yes | value check (IS NULL on mandatory fields) |
| Validity | ✅ yes | value check (value range, format, safe type conversion) |
| Uniqueness | ✅ yes | duplicate check (GROUP BY … HAVING count(*) > 1) |
| Referential integrity / consistency | ✅ partly | foreign-key check; cross-field rules in the WHERE |
| Accuracy | ❌ no | — |
| Timeliness | ❌ no | — |
Four criteria, then, are within reach of generic, configurable SQL — and with surprisingly little code. “Verifiable” means: SQL executes the rule; the domain has to formulate it — a price > 0 is written into no engine by default. And “rule” means deterministic — true or false per record. Statistical methods such as distribution analysis, outlier and anomaly detection, or profiling are a toolbox of their own and deliberately stay outside this series. Completeness is an IS NULL check on the mandatory columns — in the book’s wording, “the individual attributes contain no NULL values”. Validity is a value-range or format check; its classic case is safe type conversion, which — for example with TRY_CONVERT in SQL Server — turns an invalid value into NULL instead of aborting the load (more in the type conversion basics). Uniqueness is a grouping over the business key with HAVING count(*) > 1. Referential integrity — in the book’s wording, “every foreign key must uniquely reference an existing primary key” — is a foreign-key check.
The “partly” for consistency deserves honesty — and a distinction: referential integrity is only the relational special case of consistency; the book explicitly calls its key criteria a “special alignment toward the relational database model”. Consistency itself reaches further: the book explicitly includes the reconciliation of data across different applications — that a value does not contradict itself across systems. A single SQL check in one database cannot deliver that cross-system reconciliation; the variants checkable inside the database (referential integrity, cross-field rules within a row) it covers cleanly. The boundary, however, is set by the process, not the technology: bring the data needed for the reconciliation — the master from the other system, the reference table of the other application — into the ETL process, and they sit side by side in one database; the cross-system reconciliation becomes an ordinary SQL check. The foreign-key check then runs against a master that originally came from an entirely different system.
The book draws this line itself, by the way — just elsewhere: in its example metrics per criterion (chapter 7, table 7–3), completeness is an automated “number of NULL values” query, while content accuracy and timeliness list “user feedback” as the measuring method. Where no query can reach, a human has to answer.
The two blind spots
That leaves the two criteria where SQL alone has to pass — not because of a weakness of the language, but because the required information simply is not in the database. Both sit in the book’s catalog right next to the checkable ones.
The first blind spot is timeliness. Whether an address is “current” cannot be read off the address itself — it looks identical yesterday and today. Timeliness demands that the data reflect “the current state of the modeled world”, as the book puts it — and that takes a temporal reference: a timestamp of when the value was last confirmed, and an expectation of how long it stays valid. If such a last_verified field exists, SQL can check the rule (last_verified < now() - interval '1 year') — but then it checks the completeness and validity of that timestamp, not the currency of the actual value. Without the timestamp, the criterion is invisible to a pure database check.
The second blind spot is accuracy — and it is the subtlest point of this whole article, because the book defines it as having “a content component and a formal component”. The formal side — the right data type, the predefined format — is exactly validity, and thus checkable. The content side is not: a birth date of 1990-05-14 is perfectly valid — right type, plausible range, clean format. Whether it is the person’s actual birth date, the database does not know and cannot know. That would take an external source of truth: the ID document, a population register, a second independent record. SQL compares data against rules, not against the world.
This makes the boundary more precise: accuracy is unverifiable as long as the source of truth lies outside the database. As soon as it becomes available as data — say, a leading system that is co-extracted in the ETL process, the same pattern as with the cross-system consistency reconciliation —, the check becomes an ordinary SQL comparison: source against leading system, deviation equals finding. The typical case is data migration, where the target system already knows many of the records; how to build such reconciliations systematically is shown in the article on verifying a migration. What is checked then, however, is agreement with the designated source of truth — whether that source itself agrees with the world remains a governance decision, not an SQL question.
This boundary is not an excuse but the honest core of the matter: an SQL check guarantees that data is well-formed and internally coherent — not that it is true. Whoever says “automatically checked” has to name the two criteria left out — otherwise they are selling a green checkmark as proof of truth.
What bad data costs
You will find plenty of charts online along the lines of “37 % of all data errors are completeness errors”. No such chart appears here, deliberately — there is no citable primary source for a reliable frequency distribution per dimension, and a made-up number would be exactly the kind of data error this article is about.
What is citable is the economics behind it — and it has two sides. The first concerns when you check. The 1-10-100 rule originally comes from quality management (Labovitz and Chang, Making Quality Work, 1992) and has since been applied to data quality: what costs 1 to prevent at the source costs 10 to correct afterwards and 100 if you let the error take effect — as a wrong invoice, a lost customer, a bad decision. The specific factors are a rule of thumb, not a law of nature; the direction, though, is undisputed and matches everyday experience: an error gets more expensive the later it surfaces. That is exactly why checking at the source (staging) pays off, before the data flows on — there an error is still reportable, at the target it is fatal.
The second side concerns not when you check but how far you should push data quality. In their book Datenqualität erfolgreich steuern (3rd edition, Edition TDWI, figure 3–2, p. 43), Apel, Behme, Eberlein, and Merighi contrast two opposing cost curves: the costs caused by poor data quality fall as quality rises — bad decisions, rework, and lost customers become rarer. The costs of producing and assuring good quality rise instead, and disproportionately so the closer you get to 100 %. The sum of both — the total quality cost — has its minimum not at 100 % but at an optimum in between. That is the core message: 100 % data quality is rarely the economic target; what you are looking for is the most cost-effective combination for the purpose at hand.
Own rendering after Apel et al., “Datenqualität erfolgreich steuern” (3rd edition, Edition TDWI, figure 3–2, p. 43) — the classic total-cost-of-quality model.
The chart is deliberately simplified — a principle, not a measurement; the optimum shifts with data and purpose. Two practical consequences follow:
- Achieve a lot with little effort. The production curve stays flat for a long stretch and only explodes at the top end. The cheap early stretch is the obvious
NULLvalues, broken formats, and duplicates that generic SQL extracts with little effort — exactly the four checkable criteria. The last few percent, and even more so the two uncheckable criteria (accuracy and timeliness), cost far more and are only worth it where the domain demands it. - Prevent early instead of cleaning up late. Prevention at the source is the strongest lever: avoiding errors there, instead of detecting and cleansing them later, cuts total costs by roughly two thirds on average, according to the research cited in the same chapter (Redman 2008).
From theory to practice
The framework is in place: errors sort into a handful of classes, good data describes itself in a handful of criteria, and four of them are checkable with generic SQL. The road from here into practice runs through four articles:
- The framework — Data quality checks with SQL — builds the shared error table, the severity gate, and the configurable runner that generates the three routines at runtime.
- Validity + completeness are deepened in Validate data with SQL — value ranges, mandatory fields, and the distinction between “no value” and “unknown value”.
- Uniqueness is deepened in Find duplicates with SQL — from
count(*) > 1to composite keys. - Consistency / integrity is deepened in Find orphaned records — checking referential integrity, even without a foreign-key constraint.
With the dimensions in mind, these four articles no longer read as loose tricks but as what they are: one answer each to one measurable property of good data.
FAQ
It depends on the model. The German-language practitioner’s standard reference (Apel et al., Datenqualität erfolgreich steuern) lists sixty possible quality criteria and highlights six for the business intelligence context: accuracy, consistency, reliability, completeness, timeliness, and relevance. Internationally, the six dimensions of DAMA UK (2013) and the fifteen characteristics of ISO/IEC 25012 are widespread; the academic origin is the fifteen dimensions of Wang & Strong (1996). They all describe the same subject at different resolutions — which list fits depends on the purpose.
Validity means: the value conforms to the rules — right type, valid format, allowed value range. Accuracy means: the value corresponds to reality. The standard reference bundles both under “accuracy” (a content and a formal component) — but for an SQL check, exactly this separation is decisive: a birth date can be perfectly valid (well-formed, plausible) and still factually wrong. SQL checks the formal side, because the rules live in the database; content agreement with reality needs an external source of truth and cannot be checked with SQL alone. There is one exception: if a designated source of truth is available as data — such as the leading system in a data migration —, the accuracy check becomes an SQL reconciliation against that source.
Four criteria can be checked with generic SQL directly in the database: completeness (IS NULL check on mandatory fields), validity (value range/format), uniqueness (GROUP BY … HAVING count(*) > 1), and referential integrity/consistency (foreign-key check). Not checkable are content accuracy and timeliness — both need information from outside the database (an external reference or a timestamp).
A technical (syntactic) error violates the form and is detectable without domain knowledge: letters in a numeric column, a foreign key pointing nowhere. A business (semantic) error is formally correct and still wrong: an age of 200 or a ship date before the order date. Generic SQL catches technical errors easily; business errors need an explicitly formulated business rule.
Not necessarily for the four checkable criteria — a handful of generic SQL checks with a central error table covers completeness, validity, uniqueness, and referential integrity. Specialized tools (dbt tests, Great Expectations, Soda) and data-observability platforms automate and orchestrate this more comfortably and add reporting, but at their core they check the same criteria. Accuracy and timeliness they do not solve either — not without an external reference or a timestamp.
Related Articles
- Data quality checks with SQL — the configurable SQL framework that casts the four checkable criteria of this article into three generic routines: shared error table, severity gate, dynamic runner.
- Validate data with SQL — the routine for validity and completeness, including the NULL trap.
- Find duplicates with SQL — the routine for uniqueness: maximum cardinality and composite keys.
- Find orphaned records — the routine for consistency / referential integrity.
- Data quality in an ETL process — the bigger picture: where in the ETL process the dimensions are checked and how bad data is isolated.