Data Quality with SQL: a Generic Checking Framework

Bad data gives no warning. An age of 200 years, a duplicate customer number, a country code that doesn’t exist — in the source system nobody notices. Only when the ETL run tries to push the rows into the strictly modelled target layer does the load break: on a CHECK, on a UNIQUE index, on a foreign key. Checking data quality with SQL means finding exactly those rows beforehand, classifying them by severity and sorting them out deliberately — without a special tool, with a handful of generic SQL routines.

The essentials up front:

Three generic check routines — a WHERE clause, a uniqueness check and a foreign-key check — cover a large share of typical data errors.
All three write into one shared error table: one row per violation, with the business key, the offending value and a plain-text message.
A severity (error / warning / information) drives a quality gate: only error-free rows flow on.
All in plain PL/pgSQL — the same basic principle that specialised data-quality tools use too, here dependency-free to build yourself.

Prerequisite: Postgres as the example engine and a central staging layer that is loaded raw first. The checks run against it set-based. That is the counter-design to the tool-centric package that mixes extraction, transformation and loading per table and scatters the quality check across the whole process (more on that in the architecture article of this series).

Data quality with SQL: the dimensions behind it
The common denominator: one error table
The configuration: one rule per row
Routine 1: the WHERE clause
Routine 2: uniqueness and cardinality
Routine 3: referential integrity
The runner: dynamic SQL — done safely
Severity and the quality gate
Why not just constraints?
What this approach does not cover
Postgres-to-SQL-Server bridge
FAQ
Related Articles

Data quality with SQL: the dimensions behind it

There is plenty to read about data quality and little to grab hold of. The literature has agreed for decades on what makes data good: Wang & Strong described fifteen dimensions in their 1996 paper “Beyond Accuracy”, the DAMA UK working group singled out six of them in 2013 as core dimensions for practice (completeness, uniqueness, timeliness, validity, accuracy, consistency), and ISO/IEC 25012 standardises data-quality characteristics as a norm. What the literature rarely delivers is the how — and when it does, it is usually tied to a particular tool.

The concrete lever, though, is obvious: anyone who wants to check data quality with SQL formulates the check as a query and ends up with a table holding the bad records. Three routines cover a large share of practice, and each one cleanly maps to one of the established dimensions:

Check routine	What it finds	Dimension
WHERE clause on one table	values outside allowed ranges, missing mandatory values	Validity (+ Completeness)
Uniqueness / maximum occurrences	duplicates, over-frequent keys	Uniqueness
Foreign key against a reference	orphaned rows without a master	Consistency / Integrity

That covers four of the six dimensions; timeliness and accuracy — in the sense of matching the real world — lie outside their reach and need other means. The theory behind it — error classes, the criteria canon, and the full coverage map including the two blind spots — is deepened in the concept article Data Quality: Dimensions and Error Classes.

Honestly placed: this approach is not new. Tools like Soda Core, dbt tests or Great Expectations follow the same approach at the core — formulate check logic, collect the hits, attach a severity; whether that happens as generated SQL or against another engine (Pandas, Spark, a data warehouse) is an implementation detail. Around that, granted, they offer more, from monitoring through profiling to lineage. The value of the home-built version is not originality but transparency: every line is readable, nothing is bound to a product, and it runs everywhere you aren’t allowed to install an extra tool.

The common denominator: one error table

The backbone is not the check but its result. All three routines write into the same table — one row per violation found, stored so that the source record can later be identified unambiguously:

  1: CREATE TABLE dq.error
  2: (
  3:     id             bigint      NOT NULL GENERATED ALWAYS AS IDENTITY
  4:    ,schema_name    text        NOT NULL
  5:    ,table_name     text        NOT NULL
  6:    ,id1_column     text
  7:    ,id1_value      text
  8:    ,id2_column     text
  9:    ,id2_value      text
 10:    ,id3_column     text
 11:    ,id3_value      text
 12:    ,error_column   text
 13:    ,error_value    text
 14:    ,severity       char(1)     NOT NULL
 15:    ,message        text        NOT NULL
 16:    ,created_on     timestamptz NOT NULL DEFAULT now()
 17:    ,CONSTRAINT pk_error            PRIMARY KEY (id)
 18:    ,CONSTRAINT ck_error_severity   CHECK (severity IN ('E', 'W', 'I'))
 19: );

  1: CREATE TABLE dq.error
  2: (
  3:     id             bigint      NOT NULL GENERATED ALWAYS AS IDENTITY
  4:    ,schema_name    text        NOT NULL
  5:    ,table_name     text        NOT NULL
  6:    ,id1_column     text
  7:    ,id1_value      text
  8:    ,id2_column     text
  9:    ,id2_value      text
 10:    ,id3_column     text
 11:    ,id3_value      text
 12:    ,error_column   text
 13:    ,error_value    text
 14:    ,severity       char(1)     NOT NULL
 15:    ,message        text        NOT NULL
 16:    ,created_on     timestamptz NOT NULL DEFAULT now()
 17:    ,CONSTRAINT pk_error            PRIMARY KEY (id)
 18:    ,CONSTRAINT ck_error_severity   CHECK (severity IN ('E', 'W', 'I'))
 19: );

The column pairs id1_column/id1_value through id3_column/id3_value hold the business key of the affected record: id1_column holds the column name (say customer_id), id1_value the value (say 4711). That lets you reconstruct the bad row later — WHERE customer_id = 4711. Three pairs are enough for composite keys; in practice one almost always suffices. Alongside, each row records the checked column (error_column), the offending value (error_value), the message in plain text and the severity (severity: Error, Warning, Information).

The configuration: one rule per row

The check rules are not cast into code but written into a table. One row = one rule. New checks arrive without a deployment, and the business side can read along with what is being checked:

  1: CREATE TABLE dq.check_rule
  2: (
  3:     id              bigint  NOT NULL GENERATED ALWAYS AS IDENTITY
  4:    ,check_type      text    NOT NULL   -- 'constraint' | 'unique' | 'lookup'
  5:    ,schema_name     text    NOT NULL
  6:    ,table_name      text    NOT NULL
  7:    ,id1_column      text    NOT NULL   -- business key (up to three)
  8:    ,id2_column      text
  9:    ,id3_column      text
 10:    ,check_column    text    NOT NULL   -- checked column
 11:    ,where_clause    text               -- 'constraint': the "bad" predicate
 12:    ,max_occurrence  int     NOT NULL DEFAULT 1   -- 'unique': allowed occurrences
 13:    ,ref_schema      text               -- 'lookup': reference table
 14:    ,ref_table       text
 15:    ,ref_column      text
 16:    ,severity        char(1) NOT NULL DEFAULT 'E'
 17:    ,message         text    NOT NULL
 18:    ,active          boolean NOT NULL DEFAULT true
 19:    ,CONSTRAINT pk_check_rule  PRIMARY KEY (id)
 20: );

  1: CREATE TABLE dq.check_rule
  2: (
  3:     id              bigint  NOT NULL GENERATED ALWAYS AS IDENTITY
  4:    ,check_type      text    NOT NULL   -- 'constraint' | 'unique' | 'lookup'
  5:    ,schema_name     text    NOT NULL
  6:    ,table_name      text    NOT NULL
  7:    ,id1_column      text    NOT NULL   -- business key (up to three)
  8:    ,id2_column      text
  9:    ,id3_column      text
 10:    ,check_column    text    NOT NULL   -- checked column
 11:    ,where_clause    text               -- 'constraint': the "bad" predicate
 12:    ,max_occurrence  int     NOT NULL DEFAULT 1   -- 'unique': allowed occurrences
 13:    ,ref_schema      text               -- 'lookup': reference table
 14:    ,ref_table       text
 15:    ,ref_column      text
 16:    ,severity        char(1) NOT NULL DEFAULT 'E'
 17:    ,message         text    NOT NULL
 18:    ,active          boolean NOT NULL DEFAULT true
 19:    ,CONSTRAINT pk_check_rule  PRIMARY KEY (id)
 20: );

check_type decides which of the three routines is built for the row. Depending on the type, different columns matter: where_clause for the WHERE check, max_occurrence for uniqueness, ref_schema/ref_table/ref_column for the foreign key. severity and message hang on every rule — so each finding carries its severity and its plain text straight from the configuration.

Routine 1: the WHERE clause

The simplest and at the same time most powerful check: a condition that describes bad rows, attached to a table. Everything the condition matches is a finding. For the rule “age must be between 0 and 120” the routine produces this statement:

  1: INSERT INTO dq.error (schema_name, table_name, id1_column, id1_value,
  2:                       error_column, error_value, severity, message)
  3: SELECT
  4:     'staging'
  5:    ,'customer'
  6:    ,'customer_id'
  7:    ,T01.customer_id::text
  8:    ,'age'
  9:    ,T01.age::text
 10:    ,'E'
 11:    ,'Age out of range 0..120'
 12: FROM
 13:    staging.customer T01
 14: WHERE
 15:    age < 0 OR age > 120;

  1: INSERT INTO dq.error (schema_name, table_name, id1_column, id1_value,
  2:                       error_column, error_value, severity, message)
  3: SELECT
  4:     'staging'
  5:    ,'customer'
  6:    ,'customer_id'
  7:    ,T01.customer_id::text
  8:    ,'age'
  9:    ,T01.age::text
 10:    ,'E'
 11:    ,'Age out of range 0..120'
 12: FROM
 13:    staging.customer T01
 14: WHERE
 15:    age < 0 OR age > 120;

The predicate on line 15 comes unchanged from where_clause. With that, this one routine covers a whole family: ranges (age < 0 OR age > 120), mandatory fields (email IS NULL), formats (length(zip) <> 5), plausibility (order_date > current_date). Two dimensions at once — validity and completeness. How to push this routine to its limits — and which NULL trap of three-valued logic it has to avoid — is covered in depth by the spoke Validating Data with SQL.

Routine 2: uniqueness and cardinality

Duplicate keys are the classic case. The check is a GROUP BY with HAVING on the count:

  1: SELECT
  2:    customer_id
  3: FROM
  4:    staging.customer
  5: GROUP BY
  6:    customer_id
  7: HAVING
  8:    count(*) > 1;

  1: SELECT
  2:    customer_id
  3: FROM
  4:    staging.customer
  5: GROUP BY
  6:    customer_id
  7: HAVING
  8:    count(*) > 1;

The real trick sits in the > 1: it comes from max_occurrence. Instead of checking for uniqueness, you check for a maximum count — > 1 for true uniqueness, > 3 for example when a key may appear at most three times. The routine then joins the keys it found back onto the table and logs every occurrence (not just the first), so that every affected row appears in the error table.

What is checked here is exactly the business key that identifies the record — the same column that lands in the error table as id1_column. That is no coincidence: this check directly mirrors the UNIQUE constraint that the strict target layer carries on the business key. What must be unique there is caught up front in the source. How to push this routine — maximum cardinality, composite keys and the NULL trap in UNIQUE, where SQL Server and Postgres differ — is deepened by the spoke Finding Duplicates with SQL.

Routine 3: referential integrity

A foreign key into the void — a country_code for which there is no entry in the master-data table. Put generically: all rows of the child table that find no partner in the master table via LEFT JOIN:

  1: SELECT
  2:    T01.customer_id
  3: FROM
  4:    staging.customer T01
  5:    LEFT JOIN staging.country T02
  6:    ON
  7:      T01.country_code = T02.country_code
  8: WHERE
  9:        T01.country_code IS NOT NULL
 10:    AND T02.country_code IS NULL;

  1: SELECT
  2:    T01.customer_id
  3: FROM
  4:    staging.customer T01
  5:    LEFT JOIN staging.country T02
  6:    ON
  7:      T01.country_code = T02.country_code
  8: WHERE
  9:        T01.country_code IS NOT NULL
 10:    AND T02.country_code IS NULL;

Child table, child column (check_column) and master (ref_schema/ref_table/ref_column) come from the configuration — so the same routine works for any master-child relationship. The IS NOT NULL condition on line 9 deliberately separates “unknown value” (an error) from “no value given” (that is the WHERE routine’s job). How to push this routine — the three phrasings LEFT JOIN … IS NULL/NOT EXISTS/NOT IN, the notorious NOT IN-plus-NULL trap, and composite and self-referencing foreign keys — is deepened by the spoke Finding Orphaned Records with SQL.

The runner: dynamic SQL — done safely

The three statements above are hard-coded. They become generic when a function assembles them at runtime from the configuration. In PL/pgSQL, format() is the right tool — and the point where you have to be careful. Identifiers belong in with %I, literals with %L; both are quoted correctly by Postgres and rule out SQL injection through table and column names. Here is the branch for the WHERE check:

  1: l_sql := format($sql$INSERT INTO dq.error
  2:                      (
  3:                          schema_name
  4:                         ,table_name
  5:                         ,id1_column
  6:                         ,id1_value
  7:                         ,error_column
  8:                         ,error_value
  9:                         ,severity
 10:                         ,message
 11:                      )
 12:                      SELECT
 13:                          %1$L
 14:                         ,%2$L
 15:                         ,%3$L
 16:                         ,T01.%4$I::text
 17:                         ,%5$L
 18:                         ,T01.%6$I::text
 19:                         ,%7$L
 20:                         ,%8$L
 21:                      FROM
 22:                         %9$I.%10$I T01
 23:                      WHERE
 24:                         %11$s
 25:                 $sql$
 26:    ,l_rule.schema_name
 27:    ,l_rule.table_name
 28:    ,l_rule.id1_column
 29:    ,l_rule.id1_column
 30:    ,l_rule.check_column
 31:    ,l_rule.check_column
 32:    ,l_rule.severity
 33:    ,l_rule.message
 34:    ,l_rule.schema_name
 35:    ,l_rule.table_name
 36:    ,l_rule.where_clause
 37: );
 38: EXECUTE l_sql;

  1: l_sql := format($sql$INSERT INTO dq.error
  2:                      (
  3:                          schema_name
  4:                         ,table_name
  5:                         ,id1_column
  6:                         ,id1_value
  7:                         ,error_column
  8:                         ,error_value
  9:                         ,severity
 10:                         ,message
 11:                      )
 12:                      SELECT
 13:                          %1$L
 14:                         ,%2$L
 15:                         ,%3$L
 16:                         ,T01.%4$I::text
 17:                         ,%5$L
 18:                         ,T01.%6$I::text
 19:                         ,%7$L
 20:                         ,%8$L
 21:                      FROM
 22:                         %9$I.%10$I T01
 23:                      WHERE
 24:                         %11$s
 25:                 $sql$
 26:    ,l_rule.schema_name
 27:    ,l_rule.table_name
 28:    ,l_rule.id1_column
 29:    ,l_rule.id1_column
 30:    ,l_rule.check_column
 31:    ,l_rule.check_column
 32:    ,l_rule.severity
 33:    ,l_rule.message
 34:    ,l_rule.schema_name
 35:    ,l_rule.table_name
 36:    ,l_rule.where_clause
 37: );
 38: EXECUTE l_sql;

The schema and column names (lines 16, 18, 22) go through %I, the fixed values through %L. The uniqueness and foreign-key branches are built on the same pattern — only the inner statement differs.

One spot stays deliberately raw: the predicate on line 24 is inserted with %s, i.e. as unchanged SQL. It has to be — where_clause is a SQL expression, not a value. That makes the dq.check_rule table the trust boundary of the system: whoever may write there can have arbitrary SQL executed. In practice this is uncritical, because the configuration is maintained administratively and never filled from user input — but you have to know it and secure it. Identifiers, by contrast, are watertight through %I: a column name like age"; DROP TABLE staging.customer; -- from the configuration becomes a (non-existent) quoted identifier and raises a clean error instead of dropping the table.

Across all active rules of a table the runner loops, executes the built statement per rule and finally writes the severity counters back.

Take it with you: The complete framework — error table, rule table, runner, demo data and gate query — is available as a runnable SQL bundle for download (PostgreSQL 13+, verified against Postgres 16). An empty database is all you need; the script sets up everything itself.

A word on runtime, to be honest: most freely configured predicates run as a full-table scan over the source table, because no matching index exists for an arbitrary expression — across many rules on large tables that adds up. A simple email IS NULL or a key check may well use an index, but with freely configured rules you cannot rely on it. It is acceptable because the check runs in the staging window that is scheduled anyway, against the freshly loaded set, not against the production system. For very large tables it pays to limit the check to the partitions or batches loaded in the current run, instead of scanning everything every time.

Severity and the quality gate

Up to here, dq.error holds what is wrong. Steering the ETL process, however, needs a statement per record: may it proceed or not? For that the source table gets three counter columns — sys_error, sys_warning, sys_info — and the runner fills them after each run: per business key the number of findings by severity.

The write-back here runs over the single-column business key (id1) — by far the most common case. Gating composite keys across several columns would work the same way but is material for its own spoke; the error table already holds the key parts id1–id3 for it.

That turns the gate into a trivial WHERE condition:

  1: SELECT
  2:     customer_id
  3:    ,country_code
  4:    ,email
  5:    ,age
  6: FROM
  7:    staging.customer
  8: WHERE
  9:    sys_error = 0;

  1: SELECT
  2:     customer_id
  3:    ,country_code
  4:    ,email
  5:    ,age
  6: FROM
  7:    staging.customer
  8: WHERE
  9:    sys_error = 0;

Only rows without an error flow into the next layer. Warnings and information do not block — they are logged but no obstacle. That is the whole point of severity: it separates “must not proceed” from “worth a look”. In the demo set of seven rows exactly two pass the gate — the clean row and the row with the missing email (a warning only). Everything with age out of range, the unknown country and the duplicate customer number stays behind, neatly logged:

severity	id1_value	error_column	error_value	message
E	1	customer_id	1	customer_id not unique
E	1	customer_id	1	customer_id not unique
E	2	age	200	Age out of range 0..120
E	3	country_code	XX	Unknown country_code
E	5	age	-3	Age out of range 0..120
W	4	email		Email missing

Why not just constraints?

The obvious question: if the target layer has CHECK, UNIQUE and foreign-key constraints anyway — why the effort? The answer lies precisely there. The downstream tables are strictly modelled; that is intended. But a constraint knows only two outcomes: the row fits, or the whole load breaks. When loading thousands of rows, “breaks” is the worst of all options — a single bad row stops the entire process, and you don’t even know which one.

That is exactly why you check in the source up front: you identify all rows that would fail at the target constraints, classify them by severity and let only the clean ones pass. The framework does not rebuild the constraints — it is the transparent, auditable pre-filter before a deliberately strict target layer. Instead of an aborted load you get a table of findings and a process that carries on with the good data.

In fact the three routines are exactly the pre-filter for the three constraint types the target enforces:

Constraint at the target	Check routine in the source
`CHECK`	WHERE clause (Routine 1)
`UNIQUE`	Uniqueness (Routine 2)
`FOREIGN KEY`	Foreign key (Routine 3)

What is enforced as a constraint at the target is checked in the source up front — on the same business key that carries the UNIQUE constraint at the target. The difference is not what is checked but how the violation is handled: report and classify instead of aborting the load.

What this approach does not cover

Three routines are a lot, but not everything — and it pays to be honest about where the line lies:

Repair is deliberately not part of it. The framework finds and makes transparent; it corrects nothing. That is a decision, not an omission — transparency first.
Cross-field business rules (“discount only if status = active”) can sometimes be expressed as a WHERE clause, sometimes not.
Temporal consistency, accuracy against an external truth, complex patterns (beyond simple length and format checks) are topics of their own.

And if you’d rather use a finished tool: Soda Core, dbt tests and Great Expectations cover the same area, free and well maintained. The SQL home-build pays off when you want transparency, zero dependencies and full control over every line — or simply work in an environment where no extra tool may be installed.

Postgres-to-SQL-Server bridge

The pattern is not Postgres-specific. In SQL Server, sp_executesql takes the role of EXECUTE format(); identifiers are protected there with QUOTENAME() instead of %I. The structure stays identical: a configuration table, a cursor (or a loop) over the rules, a dynamically built INSERT … SELECT … WHERE per rule, a shared error table and severity columns in the source as a gate. Anyone coming from SQL Server transfers the approach one to one.

FAQ

Do I need a dedicated tool for data quality?

No. Checking data quality with SQL works with three generic routines — range, uniqueness and foreign-key checks — that cover a large share of typical errors. Tools like Soda or dbt take work off your hands but at the core do the same: build SQL, collect hits, attach a severity.

Isn’t dynamic SQL a security risk?

Only if you glue inputs together raw. With format() and %I (identifiers) or %L (literals), Postgres quotes correctly and injection through table/column names is ruled out. The freely configurable predicate (where_clause) is deliberately raw SQL — which is why the configuration table is the trust boundary and belongs under administrative protection.

Why not just use the target table’s CHECK and foreign-key constraints?

Because a constraint aborts the whole load instead of reporting bad rows one by one. The check in the source identifies all rows up front that would fail at the target, classifies them and lets only clean ones pass — auditable instead of aborted.

How do I find the faulty record again from the error table?

Via the id*_column/id*_value pairs: they hold the column name and value of the business key. WHERE <id1_column> = <id1_value> leads back to the source row. For composite keys, up to three pairs are available.

Does this work in SQL Server too?

Yes. sp_executesql replaces EXECUTE format(), QUOTENAME() replaces %I. Configuration table, error table, severity gate and the three routines stay structurally the same.

Data Quality in an ETL Process — the bigger picture: catching technical and business errors before they reach the target system.
Design Pattern // The Architecture of an ETL Process — how bad data is cleanly isolated across layered staging levels.
Data Quality // Fundamentals of Type Conversion with T-SQL — the field-wise validity check when converting, the “T” building block of this series.
Validating Data with SQL — the spoke for Routine 1: value ranges, required fields and the NULL trap of three-valued logic in detail.
Finding Duplicates with SQL — the spoke for Routine 2: maximum cardinality, composite keys and the NULL trap of UNIQUE semantics between SQL Server and Postgres.
Finding Orphaned Records with SQL — the spoke for Routine 3: LEFT JOIN … IS NULL vs. NOT EXISTS vs. NOT IN, the NOT-IN-plus-NULL trap and composite/self-referencing foreign keys.
Data Quality: Dimensions and Error Classes — the theory frame of the series: error classes, the criteria per Apel et al., and an honest coverage map of what SQL reaches — and the two criteria it cannot.

Checking Data Quality with SQL — a Configurable Framework for Spotting Bad Data Generically

Contents

Data quality with SQL: the dimensions behind it

The common denominator: one error table

The configuration: one rule per row

Routine 1: the WHERE clause

Routine 2: uniqueness and cardinality

Routine 3: referential integrity

The runner: dynamic SQL — done safely

Severity and the quality gate

Why not just constraints?

What this approach does not cover

Postgres-to-SQL-Server bridge

FAQ

Contents

Data quality with SQL: the dimensions behind it

The common denominator: one error table

The configuration: one rule per row

Routine 1: the WHERE clause

Routine 2: uniqueness and cardinality

Routine 3: referential integrity

The runner: dynamic SQL — done safely

Severity and the quality gate

Why not just constraints?

What this approach does not cover

Postgres-to-SQL-Server bridge

FAQ

Related Articles