Checking Data Quality with SQL — a Configurable Framework for Spotting Bad Data Generically

Bad data gives no warning. An age of 200 years, a duplicate customer number, a country code that doesn’t exist — in the source system nobody notices. Only when the ETL run tries to push the rows into the strictly modelled target layer does the load break: on a CHECK, on a UNIQUE index, on a foreign key. Checking data quality with SQL means finding exactly those rows beforehand, classifying them by severity and sorting them out deliberately — without a special tool, with a handful of generic SQL routines.

The essentials up front:

  • Three generic check routines — a WHERE clause, a uniqueness check and a foreign-key check — cover a large share of typical data errors.
  • All three write into one shared error table: one row per violation, with the business key, the offending value and a plain-text message.
  • severity (error / warning / information) drives a quality gate: only error-free rows flow on.
  • All in plain PL/pgSQL — the same basic principle that specialised data-quality tools use too, here dependency-free to build yourself.

Prerequisite: Postgres as the example engine and a central staging layer that is loaded raw first. The checks run against it set-based — the counter-design to the tool-centric package that mixes extraction, transformation and loading per table and scatters the quality check across the whole process (more on that in the architecture article of this series).

Data quality with SQL: the dimensions behind it

There is plenty to read about data quality and little to grab hold of. The literature has agreed for decades on what makes data good: Wang & Strong described fifteen dimensions in their 1996 paper “Beyond Accuracy”, the DAMA group condensed them in 2013 into six core dimensions (completeness, uniqueness, timeliness, validity, accuracy, consistency), and ISO/IEC 25012 turned the whole thing into a standard. What the literature rarely delivers is the how — and when it does, it is usually tied to a particular tool.

The concrete lever, though, is obvious: anyone who wants to check data quality with SQL formulates the check as a query and ends up with a table holding the bad records. Three routines cover a large share of practice, and each one cleanly maps to one of the established dimensions:

Check routineWhat it findsDimension
WHERE clause on one tablevalues outside allowed ranges, missing mandatory valuesValidity (+ Completeness)
Uniqueness / maximum occurrencesduplicates, over-frequent keysUniqueness
Foreign key against a referenceorphaned rows without a masterConsistency / Integrity

That covers four of the six dimensions; timeliness and accuracy — in the sense of matching the real world — lie outside their reach and need other means.

Honestly placed: this approach is not new. Tools like Soda Core, dbt tests or Great Expectations do nothing else under the hood — they generate SQL, collect the hits and attach a severity; around that, granted, they offer more, from monitoring through profiling to lineage. The value of the home-built version is not originality but transparency: every line is readable, nothing is bound to a product, and it runs everywhere you aren’t allowed to install an extra tool.

The common denominator: one error table

The backbone is not the check but its result. All three routines write into the same table — one row per violation found, stored so that the source record can later be identified unambiguously:

  1: CREATE TABLE dq.error
  2: (
  3:     id             bigint      NOT NULL GENERATED ALWAYS AS IDENTITY
  4:    ,schema_name    text        NOT NULL
  5:    ,table_name     text        NOT NULL
  6:    ,id1_column     text
  7:    ,id1_value      text
  8:    ,id2_column     text
  9:    ,id2_value      text
 10:    ,id3_column     text
 11:    ,id3_value      text
 12:    ,error_column   text
 13:    ,error_value    text
 14:    ,severity       char(1)     NOT NULL
 15:    ,message        text        NOT NULL
 16:    ,created_on     timestamptz NOT NULL DEFAULT now()
 17:    ,CONSTRAINT pk_error            PRIMARY KEY (id)
 18:    ,CONSTRAINT ck_error_severity   CHECK (severity IN ('E', 'W', 'I'))
 19: );

The column pairs id1_column/id1_value through id3_column/id3_value hold the business key of the affected record: id1_column holds the column name (say customer_id), id1_value the value (say 4711). That lets you reconstruct the bad row later — WHERE customer_id = 4711. Three pairs are enough for composite keys; in practice one almost always suffices. Alongside, each row records the checked column (error_column), the offending value (error_value), the message in plain text and the severity (severityError, Warning, Information).

The configuration: one rule per row

The check rules are not cast into code but written into a table. One row = one rule. New checks arrive without a deployment, and the business side can read along with what is being checked:

  1: CREATE TABLE dq.check_rule
  2: (
  3:     id              bigint  NOT NULL GENERATED ALWAYS AS IDENTITY
  4:    ,check_type      text    NOT NULL   -- 'constraint' | 'unique' | 'lookup'
  5:    ,schema_name     text    NOT NULL
  6:    ,table_name      text    NOT NULL
  7:    ,id1_column      text    NOT NULL   -- business key (up to three)
  8:    ,id2_column      text
  9:    ,id3_column      text
 10:    ,check_column    text    NOT NULL   -- checked column
 11:    ,where_clause    text               -- 'constraint': the "bad" predicate
 12:    ,max_occurrence  int     NOT NULL DEFAULT 1   -- 'unique': allowed occurrences
 13:    ,ref_schema      text               -- 'lookup': reference table
 14:    ,ref_table       text
 15:    ,ref_column      text
 16:    ,severity        char(1) NOT NULL DEFAULT 'E'
 17:    ,message         text    NOT NULL
 18:    ,active          boolean NOT NULL DEFAULT true
 19:    ,CONSTRAINT pk_check_rule  PRIMARY KEY (id)
 20: );

check_type decides which of the three routines is built for the row. Depending on the type, different columns matter: where_clause for the WHERE check, max_occurrence for uniqueness, ref_schema/ref_table/ref_column for the foreign key. severity and message hang on every rule — so each finding carries its severity and its plain text straight from the configuration.

Routine 1: the WHERE clause

The simplest and at the same time most powerful check: a condition that describes bad rows, attached to a table. Everything the condition matches is a finding. For the rule “age must be between 0 and 120” the routine produces this statement:

  1: INSERT INTO dq.error (schema_name, table_name, id1_column, id1_value,
  2:                       error_column, error_value, severity, message)
  3: SELECT
  4:     'staging'
  5:    ,'customer'
  6:    ,'customer_id'
  7:    ,T01.customer_id::text
  8:    ,'age'
  9:    ,T01.age::text
 10:    ,'E'
 11:    ,'Age out of range 0..120'
 12: FROM
 13:    staging.customer T01
 14: WHERE
 15:    age < 0 OR age > 120;

The predicate on line 15 comes unchanged from where_clause. With that, this one routine covers a whole family: ranges (age < 0 OR age > 120), mandatory fields (email IS NULL), formats (length(zip) <> 5), plausibility (order_date > current_date). Two dimensions at once — validity and completeness.

Routine 2: uniqueness and cardinality

Duplicate keys are the classic case. The check is a GROUP BY with HAVING on the count:

  1: SELECT
  2:    customer_id
  3: FROM
  4:    staging.customer
  5: GROUP BY
  6:    customer_id
  7: HAVING
  8:    count(*) > 1;

The real trick sits in the > 1: it comes from max_occurrence. Instead of checking for uniqueness, you check for a maximum count — > 1 for true uniqueness, > 3 for example when a key may appear at most three times. The routine then joins the keys it found back onto the table and logs every occurrence (not just the first), so that every affected row appears in the error table.

What is checked here is exactly the business key that identifies the record — the same column that lands in the error table as id1_column. That is no coincidence: this check directly mirrors the UNIQUE constraint that the strict target layer carries on the business key. What must be unique there is caught up front in the source.

Routine 3: referential integrity

A foreign key into the void — a country_code for which there is no entry in the master-data table. Put generically: all rows of the child table that find no partner in the master table via LEFT JOIN:

  1: SELECT
  2:    T01.customer_id
  3: FROM
  4:    staging.customer T01
  5:    LEFT JOIN staging.country T02
  6:    ON
  7:      T01.country_code = T02.country_code
  8: WHERE
  9:        T01.country_code IS NOT NULL
 10:    AND T02.country_code IS NULL;

Child table, child column (check_column) and master (ref_schema/ref_table/ref_column) come from the configuration — so the same routine works for any master-child relationship. The IS NOT NULL condition on line 9 deliberately separates “unknown value” (an error) from “no value given” (that is the WHERE routine’s job).

The runner: dynamic SQL — done safely

The three statements above are hard-coded. They become generic when a function assembles them at runtime from the configuration. In PL/pgSQL, format() is the right tool — and the point where you have to be careful. Identifiers belong in with %I, literals with %L; both are quoted correctly by Postgres and rule out SQL injection through table and column names. Here is the branch for the WHERE check:

  1: l_sql := format($sql$INSERT INTO dq.error
  2:                      (
  3:                          schema_name
  4:                         ,table_name
  5:                         ,id1_column
  6:                         ,id1_value
  7:                         ,error_column
  8:                         ,error_value
  9:                         ,severity
 10:                         ,message
 11:                      )
 12:                      SELECT
 13:                          %1$L
 14:                         ,%2$L
 15:                         ,%3$L
 16:                         ,T01.%4$I::text
 17:                         ,%5$L
 18:                         ,T01.%6$I::text
 19:                         ,%7$L
 20:                         ,%8$L
 21:                      FROM
 22:                         %9$I.%10$I T01
 23:                      WHERE
 24:                         %11$s
 25:                 $sql$
 26:    ,l_rule.schema_name
 27:    ,l_rule.table_name
 28:    ,l_rule.id1_column
 29:    ,l_rule.id1_column
 30:    ,l_rule.check_column
 31:    ,l_rule.check_column
 32:    ,l_rule.severity
 33:    ,l_rule.message
 34:    ,l_rule.schema_name
 35:    ,l_rule.table_name
 36:    ,l_rule.where_clause
 37: );
 38: EXECUTE l_sql;

The schema and column names (lines 16, 18, 22) go through %I, the fixed values through %L. The uniqueness and foreign-key branches are built on the same pattern — only the inner statement differs.

One spot stays deliberately raw: the predicate on line 24 is inserted with %s, i.e. as unchanged SQL. It has to be — where_clause is a SQL expression, not a value. That makes the dq.check_rule table the trust boundary of the system: whoever may write there can have arbitrary SQL executed. In practice this is uncritical, because the configuration is maintained administratively and never filled from user input — but you have to know it and secure it. Identifiers, by contrast, are watertight through %I: a column name like age"; DROP TABLE staging.customer; -- from the configuration becomes a (non-existent) quoted identifier and raises a clean error instead of dropping the table.

Across all active rules of a table the runner loops, executes the built statement per rule and finally writes the severity counters back.

A word on runtime, to be honest: each rule is at its core a full-table scan over the source table, and for a freely configured predicate there is no matching index — across many rules on large tables that adds up. It is acceptable because the check runs in the staging window that is scheduled anyway, against the freshly loaded set, not against the production system. For very large tables it pays to limit the check to the partitions or batches loaded in the current run, instead of scanning everything every time.

Severity and the quality gate

Up to here, dq.error holds what is wrong. Steering the ETL process, however, needs a statement per record: may it proceed or not? For that the source table gets three counter columns — sys_errorsys_warningsys_info — and the runner fills them after each run: per business key the number of findings by severity.

The write-back here runs over the single-column business key (id1) — by far the most common case. Gating composite keys across several columns would work the same way but is material for its own spoke; the error table already holds the key parts id1id3 for it.

That turns the gate into a trivial WHERE condition:

  1: SELECT
  2:     customer_id
  3:    ,country_code
  4:    ,email
  5:    ,age
  6: FROM
  7:    staging.customer
  8: WHERE
  9:    sys_error = 0;

Only rows without an error flow into the next layer. Warnings and information do not block — they are logged but no obstacle. That is the whole point of severity: it separates “must not proceed” from “worth a look”. In the demo set of seven rows exactly two pass the gate — the clean row and the row with the missing email (a warning only). Everything with age out of range, the unknown country and the duplicate customer number stays behind, neatly logged:

severityid1_valueerror_columnerror_valuemessage
E1customer_id1customer_id not unique
E1customer_id1customer_id not unique
E2age200Age out of range 0..120
E3country_codeXXUnknown country_code
E5age-3Age out of range 0..120
W4emailEmail missing

Why not just constraints?

The obvious question: if the target layer has CHECKUNIQUE and foreign-key constraints anyway — why the effort? The answer lies precisely there. The downstream tables are strictly modelled; that is intended. But a constraint knows only two outcomes: the row fits, or the whole load breaks. When loading thousands of rows, “breaks” is the worst of all options — a single bad row stops the entire process, and you don’t even know which one.

That is exactly why you check in the source up front: you identify all rows that would fail at the target constraints, classify them by severity and let only the clean ones pass. The framework does not rebuild the constraints — it is the transparent, auditable pre-filter before a deliberately strict target layer. Instead of an aborted load you get a table of findings and a process that carries on with the good data.

In fact the three routines are exactly the pre-filter for the three constraint types the target enforces:

Constraint at the targetCheck routine in the source
CHECKWHERE clause (Routine 1)
UNIQUEUniqueness (Routine 2)
FOREIGN KEYForeign key (Routine 3)

What is enforced as a constraint at the target is checked in the source up front — on the same business key that carries the UNIQUE constraint at the target. The difference is not what is checked but how the violation is handled: report and classify instead of aborting the load.

What this approach does not cover

Three routines are a lot, but not everything — and it pays to be honest about where the line lies:

  • Repair is deliberately not part of it. The framework finds and makes transparent; it corrects nothing. That is a decision, not an omission — transparency first.
  • Cross-field business rules (“discount only if status = active”) can sometimes be expressed as a WHERE clause, sometimes not.
  • Temporal consistency, accuracy against an external truth, complex patterns (beyond simple length and format checks) are topics of their own.

And if you’d rather use a finished tool: Soda Core, dbt tests and Great Expectations cover the same area, free and well maintained. The SQL home-build pays off when you want transparency, zero dependencies and full control over every line — or simply work in an environment where no extra tool may be installed.

Postgres-to-SQL-Server bridge

The pattern is not Postgres-specific. In SQL Server, sp_executesql takes the role of EXECUTE format(); identifiers are protected there with QUOTENAME() instead of %I. The structure stays identical: a configuration table, a cursor (or a loop) over the rules, a dynamically built INSERT … SELECT … WHERE per rule, a shared error table and severity columns in the source as a gate. Anyone coming from SQL Server transfers the approach one to one.

FAQ

Do I need a dedicated tool for data quality?

No. Checking data quality with SQL works with three generic routines — range, uniqueness and foreign-key checks — that cover a large share of typical errors. Tools like Soda or dbt take work off your hands but at the core do the same: build SQL, collect hits, attach a severity.

Isn’t dynamic SQL a security risk?

Only if you glue inputs together raw. With format() and %I (identifiers) or %L (literals), Postgres quotes correctly and injection through table/column names is ruled out. The freely configurable predicate (where_clause) is deliberately raw SQL — which is why the configuration table is the trust boundary and belongs under administrative protection.

Why not just use the target table’s CHECK and foreign-key constraints?

Because a constraint aborts the whole load instead of reporting bad rows one by one. The check in the source identifies all rows up front that would fail at the target, classifies them and lets only clean ones pass — auditable instead of aborted.

How do I find the faulty record again from the error table?

Via the id*_column/id*_value pairs: they hold the column name and value of the business key. WHERE <id1_column> = <id1_value> leads back to the source row. For composite keys, up to three pairs are available.

Does this work in SQL Server too?

Yes. sp_executesql replaces EXECUTE format()QUOTENAME() replaces %I. Configuration table, error table, severity gate and the three routines stay structurally the same.

Related Articles