Data Quality: Dimensions and Error Classes — the Theory Behind the SQL Checks

A lot gets written about data quality, and very little gets measured. The German-language practitioner’s standard reference alone lists sixty possible quality criteria — from timeliness to reliability —, and even the lean models still arrive at six to fifteen dimensions. Yet the core of the matter is surprisingly tangible: a data error caught during loading costs you a log entry. The same error running through to the report costs a wrong invoice, a bad decision — and the trust in every number that follows. This article makes the underlying framework tangible, without a theory slog: it sorts the common data errors into classes, maps them to the established dimensions of data quality — and uses an honest coverage map to show which of them three generic SQL routines actually cover. And it names the two they cannot reach.

The essentials up front:

  • Data errors sort into a handful of classes — technical vs. business, and field-level vs. record-level vs. relationship-level. That is enough to place any concrete error.
  • The dimensions of data quality are established canon, not taste — defined criteria in the German-language standard reference by Apel et al., internationally in Wang & Strong, DAMA, and ISO/IEC 25012.
  • Generic SQL checks four criteria directly in the database: completeness, validity, uniqueness, and referential integrity/consistency. Two stay out of reach — content accuracy and timeliness; both need information from outside the database.
  • Checking early is cheap, repairing late is expensive: prevention at the source cuts total costs by roughly two thirds (Redman, as cited by Apel et al.) — and 100 % data quality is not the economic target anyway.

Prerequisites: none. This article provides the framing before the SQL practice begins; the theory part needs no SQL knowledge. If you want the how right away, head for the framework article Data quality checks with SQL.

Contents

What data errors look like: a classification

Before slicing data quality into dimensions, it helps to look at the errors themselves. They can be sorted along two axes, and together the two axes cover practically every case. One thing up front, for honesty’s sake: the distinctions themselves — syntactic versus semantic, field versus record versus relationship — are common in the literature; their arrangement into a two-axis matrix is this article’s working classification, not an established canon. For practice, it is all you need.

The first axis is technical versus business. A technical (syntactic) error violates the form: letters in a numeric column, a date that is no date, a foreign key pointing nowhere. You can spot these errors without any domain knowledge — they break the rules of the data type or the model. A business (semantic) error, by contrast, is formally flawless and still wrong: an age of 200, a ship date before the order date, a customer flagged as “premium” without a single order. Only domain knowledge makes the error visible.

The second axis is the scope of the error:

  • field-level — the error sits in a single value (age = 200"abc" in a numeric column).
  • record-level — the error only emerges from the interplay of several fields in one row (ship_date < order_date) or between rows of one table (two records sharing the same key).
  • relationship-level — the error lies in the relationship between tables (a reference whose target does not exist).

The two axes combine into a compact taxonomy in which every concrete error gets a place and a responsible check routine:

Error-class taxonomy: a matrix of technical/syntactic versus business/semantic on one axis and field-level, record-level, relationship-level on the other. Each cell names a concrete example and the violated criterion; a band below marks the missing value as the special case completeness, cutting across all cells.

Scopetechnical / syntacticbusiness / semantic
field-level"abc" in a numeric column, an invalid date → validityage = 200, a negative price → validity (business rule)
record-leveltwo records with the same primary key → uniquenessship_date < order_date → consistency (cross-field)
relationship-levelforeign key pointing to a missing master → consistency (integrity)“premium” customer without an order → business rule

The cells name the dominant criterion — in practice, one error often violates several at once: a foreign key pointing nowhere is also a completeness problem, and a negative price is invalid or simply untrue, depending on how you read it.

One special case falls outside the grid: the value that is missing. An empty mandatory field is neither technically nor semantically malformed — it simply is not there. A criterion of its own is responsible for it, completeness — it cannot be pinned to any single cell but cuts across the whole matrix. Two more kinds of error are absent from the matrix entirely — the value that is well-formed and still factually wrong, and the value that is outdated. We will come back to both; they are the blind spots.

The dimensions of data quality

The classification says what an error looks like. The dimensions of data quality say which property of good data it violates. There is no shortage of theory here — rather a surplus, and that is why the term so often stays fuzzy.

The German-language practitioner’s standard reference on data quality in business intelligence projects is Datenqualität erfolgreich steuern (“Managing Data Quality Successfully”) by Apel, Behme, Eberlein, and Merighi (3rd edition, Edition TDWI). Following Würthele, it defines data quality as a “multidimensional measure of the suitability of data to fulfill the purpose tied to its capture/generation” — a suitability that can change over time as needs change (all book quotes translated from the German original). Two things are already baked into this definition: quality is multidimensional, and it is purpose-bound. There is no absolute “good”, only a “good enough for this purpose”.

How many dimensions there are is a matter of definition. The book starts with an alphabetical catalog of sixty possible quality criteria and narrows it down to the practically viable ones; for the business intelligence context it highlights six — accuracy, consistency, reliability, completeness, timeliness, and relevance. For this article, the criteria that matter are the ones data errors can technically be pinned to:

CriterionWhat it demands (per Apel et al.)
CompletenessThe attributes are populated with values that “semantically differ from the value NULL (unknown)”; no data gets lost in transformations.
Validity (in the book: formal accuracy + uniformity)The values arrive in the predefined format and are represented uniformly — in practice, the allowed value range belongs here too.
Uniqueness (in the book: freedom from redundancy + key uniqueness)No record describes the same real-world entity twice; the business key occurs only as often as it may.
Referential integrity / consistencyEvery foreign key uniquely references an existing primary key; values do not contradict each other — within a record, between records, across applications.
Accuracy (content)The values match the entities of the real world — the data corresponds to reality.
TimelinessThe records reflect the current state of the modeled world and are not outdated.

Two names in this table each bundle two of the book’s criteria — the book slices finer than international usage does. Validity does appear in the catalog, but the actual definition lives in the formal component of accuracy (delivery in the predefined data format) and in uniformity. And what DAMA and ISO call uniqueness, the book lists as freedom from redundancy and key uniqueness; its own criterion named “uniqueness” means something else there, namely the unambiguous interpretability of a record through its metadata. Key uniqueness the book phrases in terms of primary keys — yet the check is only meaningful on the business key, because a constraint-enforced primary key cannot occur twice in the first place. The check gets interesting exactly where the constraint is (still) missing: in staging tables and at interfaces. This article sticks to the common names and means the book criteria listed above.

That several such lists exist is not a contradiction but a method: group the criteria and you get a quality model. The book presents two side by side — a theoretical taxonomy after Hinrichs, and the categorization of the German Society for Information and Data Quality (DGIQ), derived from a user survey and in turn based on the much-cited study by Wang and Strong (1996) with its fifteen dimensions. Internationally, the six dimensions of DAMA UK (2013) and the ISO/IEC 25012 standard are widely used as well. They are different cuts through the same subject — which one fits depends on the purpose. For the rest of this article we stay with the book’s criteria, because they point most directly at an SQL check.

What can SQL actually check?

Now for the decisive question: how many of these criteria does a handful of generic SQL checks actually reach? The practical framework of the SQL check series builds on three routines — a value-based WHERE check, a duplicate check, and a foreign-key check. Plot them against the criteria and you get this coverage map:

Coverage map: three SQL routines (value check, duplicate check, foreign-key check) against the data quality criteria per Apel et al. Four criteria — completeness, validity, uniqueness, and referential integrity/consistency — are marked as covered, two — accuracy and timeliness — as gaps.

Criterionverifiable with SQL?Routine
Completeness✅ yesvalue check (IS NULL on mandatory fields)
Validity✅ yesvalue check (value range, format, safe type conversion)
Uniqueness✅ yesduplicate check (GROUP BY … HAVING count(*) > 1)
Referential integrity / consistency✅ partlyforeign-key check; cross-field rules in the WHERE
Accuracy❌ no
Timeliness❌ no

Four criteria, then, are within reach of generic, configurable SQL — and with surprisingly little code. “Verifiable” means: SQL executes the rule; the domain has to formulate it — a price > 0 is written into no engine by default. And “rule” means deterministic — true or false per record. Statistical methods such as distribution analysis, outlier and anomaly detection, or profiling are a toolbox of their own and deliberately stay outside this series. Completeness is an IS NULL check on the mandatory columns — in the book’s wording, “the individual attributes contain no NULL values”. Validity is a value-range or format check; its classic case is safe type conversion, which — for example with TRY_CONVERT in SQL Server — turns an invalid value into NULL instead of aborting the load (more in the type conversion basics). Uniqueness is a grouping over the business key with HAVING count(*) > 1Referential integrity — in the book’s wording, “every foreign key must uniquely reference an existing primary key” — is a foreign-key check.

The “partly” for consistency deserves honesty — and a distinction: referential integrity is only the relational special case of consistency; the book explicitly calls its key criteria a “special alignment toward the relational database model”. Consistency itself reaches further: the book explicitly includes the reconciliation of data across different applications — that a value does not contradict itself across systems. A single SQL check in one database cannot deliver that cross-system reconciliation; the variants checkable inside the database (referential integrity, cross-field rules within a row) it covers cleanly. The boundary, however, is set by the process, not the technology: bring the data needed for the reconciliation — the master from the other system, the reference table of the other application — into the ETL process, and they sit side by side in one database; the cross-system reconciliation becomes an ordinary SQL check. The foreign-key check then runs against a master that originally came from an entirely different system.

The book draws this line itself, by the way — just elsewhere: in its example metrics per criterion (chapter 7, table 7–3), completeness is an automated “number of NULL values” query, while content accuracy and timeliness list “user feedback” as the measuring method. Where no query can reach, a human has to answer.

The two blind spots

That leaves the two criteria where SQL alone has to pass — not because of a weakness of the language, but because the required information simply is not in the database. Both sit in the book’s catalog right next to the checkable ones.

The first blind spot is timeliness. Whether an address is “current” cannot be read off the address itself — it looks identical yesterday and today. Timeliness demands that the data reflect “the current state of the modeled world”, as the book puts it — and that takes a temporal reference: a timestamp of when the value was last confirmed, and an expectation of how long it stays valid. If such a last_verified field exists, SQL can check the rule (last_verified < now() - interval '1 year') — but then it checks the completeness and validity of that timestamp, not the currency of the actual value. Without the timestamp, the criterion is invisible to a pure database check.

The second blind spot is accuracy — and it is the subtlest point of this whole article, because the book defines it as having “a content component and a formal component”. The formal side — the right data type, the predefined format — is exactly validity, and thus checkable. The content side is not: a birth date of 1990-05-14 is perfectly valid — right type, plausible range, clean format. Whether it is the person’s actual birth date, the database does not know and cannot know. That would take an external source of truth: the ID document, a population register, a second independent record. SQL compares data against rules, not against the world.

This makes the boundary more precise: accuracy is unverifiable as long as the source of truth lies outside the database. As soon as it becomes available as data — say, a leading system that is co-extracted in the ETL process, the same pattern as with the cross-system consistency reconciliation —, the check becomes an ordinary SQL comparison: source against leading system, deviation equals finding. The typical case is data migration, where the target system already knows many of the records; how to build such reconciliations systematically is shown in the article on verifying a migration. What is checked then, however, is agreement with the designated source of truth — whether that source itself agrees with the world remains a governance decision, not an SQL question.

This boundary is not an excuse but the honest core of the matter: an SQL check guarantees that data is well-formed and internally coherent — not that it is true. Whoever says “automatically checked” has to name the two criteria left out — otherwise they are selling a green checkmark as proof of truth.

What bad data costs

You will find plenty of charts online along the lines of “37 % of all data errors are completeness errors”. No such chart appears here, deliberately — there is no citable primary source for a reliable frequency distribution per dimension, and a made-up number would be exactly the kind of data error this article is about.

What is citable is the economics behind it — and it has two sides. The first concerns when you check. The 1-10-100 rule originally comes from quality management (Labovitz and Chang, Making Quality Work, 1992) and has since been applied to data quality: what costs 1 to prevent at the source costs 10 to correct afterwards and 100 if you let the error take effect — as a wrong invoice, a lost customer, a bad decision. The specific factors are a rule of thumb, not a law of nature; the direction, though, is undisputed and matches everyday experience: an error gets more expensive the later it surfaces. That is exactly why checking at the source (staging) pays off, before the data flows on — there an error is still reportable, at the target it is fatal.

The second side concerns not when you check but how far you should push data quality. In their book Datenqualität erfolgreich steuern (3rd edition, Edition TDWI, figure 3–2, p. 43), Apel, Behme, Eberlein, and Merighi contrast two opposing cost curves: the costs caused by poor data quality fall as quality rises — bad decisions, rework, and lost customers become rarer. The costs of producing and assuring good quality rise instead, and disproportionately so the closer you get to 100 %. The sum of both — the total quality cost — has its minimum not at 100 % but at an optimum in between. That is the core message: 100 % data quality is rarely the economic target; what you are looking for is the most cost-effective combination for the purpose at hand.

Two opposing cost curves over data quality (x-axis up to 100 percent, y-axis costs): the costs caused by poor data quality fall as quality rises, the costs of improving and assuring it rise and shoot up toward 100 percent. A dashed sum curve hints at the total costs; a gray band at the intersection marks their minimum — the economic optimum, well before 100 percent.

Own rendering after Apel et al., “Datenqualität erfolgreich steuern” (3rd edition, Edition TDWI, figure 3–2, p. 43) — the classic total-cost-of-quality model.

The chart is deliberately simplified — a principle, not a measurement; the optimum shifts with data and purpose. Two practical consequences follow:

  • Achieve a lot with little effort. The production curve stays flat for a long stretch and only explodes at the top end. The cheap early stretch is the obvious NULL values, broken formats, and duplicates that generic SQL extracts with little effort — exactly the four checkable criteria. The last few percent, and even more so the two uncheckable criteria (accuracy and timeliness), cost far more and are only worth it where the domain demands it.
  • Prevent early instead of cleaning up late. Prevention at the source is the strongest lever: avoiding errors there, instead of detecting and cleansing them later, cuts total costs by roughly two thirds on average, according to the research cited in the same chapter (Redman 2008).

From theory to practice

The framework is in place: errors sort into a handful of classes, good data describes itself in a handful of criteria, and four of them are checkable with generic SQL. The road from here into practice runs through four articles:

  • The framework — Data quality checks with SQL — builds the shared error table, the severity gate, and the configurable runner that generates the three routines at runtime.
  • Validity + completeness are deepened in Validate data with SQL — value ranges, mandatory fields, and the distinction between “no value” and “unknown value”.
  • Uniqueness is deepened in Find duplicates with SQL — from count(*) > 1 to composite keys.
  • Consistency / integrity is deepened in Find orphaned records — checking referential integrity, even without a foreign-key constraint.

With the dimensions in mind, these four articles no longer read as loose tricks but as what they are: one answer each to one measurable property of good data.

FAQ

What data quality dimensions are there?

It depends on the model. The German-language practitioner’s standard reference (Apel et al., Datenqualität erfolgreich steuern) lists sixty possible quality criteria and highlights six for the business intelligence context: accuracy, consistency, reliability, completeness, timeliness, and relevance. Internationally, the six dimensions of DAMA UK (2013) and the fifteen characteristics of ISO/IEC 25012 are widespread; the academic origin is the fifteen dimensions of Wang & Strong (1996). They all describe the same subject at different resolutions — which list fits depends on the purpose.

What is the difference between validity and accuracy?

Validity means: the value conforms to the rules — right type, valid format, allowed value range. Accuracy means: the value corresponds to reality. The standard reference bundles both under “accuracy” (a content and a formal component) — but for an SQL check, exactly this separation is decisive: a birth date can be perfectly valid (well-formed, plausible) and still factually wrong. SQL checks the formal side, because the rules live in the database; content agreement with reality needs an external source of truth and cannot be checked with SQL alone. There is one exception: if a designated source of truth is available as data — such as the leading system in a data migration —, the accuracy check becomes an SQL reconciliation against that source.

How many data quality dimensions can you check with SQL?

Four criteria can be checked with generic SQL directly in the database: completeness (IS NULL check on mandatory fields), validity (value range/format), uniqueness (GROUP BY … HAVING count(*) > 1), and referential integrity/consistency (foreign-key check). Not checkable are content accuracy and timeliness — both need information from outside the database (an external reference or a timestamp).

What is the difference between technical and business data errors?

A technical (syntactic) error violates the form and is detectable without domain knowledge: letters in a numeric column, a foreign key pointing nowhere. A business (semantic) error is formally correct and still wrong: an age of 200 or a ship date before the order date. Generic SQL catches technical errors easily; business errors need an explicitly formulated business rule.

Do I need a tool to measure data quality?

Not necessarily for the four checkable criteria — a handful of generic SQL checks with a central error table covers completeness, validity, uniqueness, and referential integrity. Specialized tools (dbt tests, Great Expectations, Soda) and data-observability platforms automate and orchestrate this more comfortably and add reporting, but at their core they check the same criteria. Accuracy and timeliness they do not solve either — not without an external reference or a timestamp.