Finding Orphaned Records — Checking Referential Integrity Without a Foreign Key

A foreign key pointing into the void is quick to describe — a country_code with no matching entry in the reference data — and surprisingly easy to check incorrectly. The most intuitive phrasing is, of all things, the most dangerous: NOT IN (SELECT …) reads like plain English but silently collapses the moment the reference column holds a single NULL. The check then reports zero orphaned records, runs green, and finds nothing — while the bad rows flow on unnoticed until they hit the foreign-key constraint at the target and bring the whole load down. Anyone who wants to find orphaned records has to know this trap.

The key points up front:

  • The basic form is a LEFT JOIN … IS NULL: every child row that finds no partner in the master table — generic across any master-child relationship.
  • Three phrasings do the same job, but not equally safely: LEFT JOIN … IS NULL and NOT EXISTS are NULL-robust, NOT IN is not — a single NULL in the reference paralyzes the entire check.
  • Composite foreign keys are checked with a multi-column join — every key part belongs in the ON clause.
  • Self-referencing relationships (a hierarchy such as manager_id → employee_id) are checked with a self-join; root nodes (IS NULL) are not orphans.

Prerequisite: Postgres as the example engine and the shared checking scaffold from the framework article — a central error table dq.error that records one row per violation, and a staging layer to check against. This article drills into the third of its three routines: referential integrity.

Contents

Finding Orphaned Records: the Basic Form

An orphaned record is a row whose foreign key points to a master that does not exist: an order to a deleted customer, a customer with a country_code missing from the country table. As long as no foreign-key constraint enforces the relationship — and in a raw-loaded staging layer it deliberately does not — such a row is technically allowed and functionally broken.

Put generically, you look for all rows of the child table that find no partner in the master table via a LEFT JOIN. Wherever the join matches nothing, the LEFT JOIN fills the master columns with NULL — and that NULL is exactly what you filter on:

  1: SELECT
  2:     T01.customer_id
  3:    ,T01.country_code
  4: FROM
  5:    staging.customer T01
  6:    LEFT JOIN staging.country T02
  7:    ON
  8:      T01.country_code = T02.country_code
  9: WHERE
 10:        T01.country_code IS NOT NULL
 11:    AND T02.country_code IS NULL;
 12: -- 3 | XX     orphaned record: country_code XX has no master

The result is customer 3 with the code XX, which does not exist in the country table. The child table, the child column, and the master are the only three knobs — which makes the same query usable for any master-child relationship. This is the dimension consistency / integrity: does every reference point to something that actually exists?

The condition on line 10 (T01.country_code IS NOT NULL) is not decoration — it separates “unknown value” from “no value.” More on that in a moment; first the actual trap.

LEFT JOIN vs. NOT EXISTS vs. NOT IN — and the NULL Trap

The same question — “which child rows have no master?” — can be phrased three ways. All three return the same result on clean data; under NULL they diverge.

The first two are robust. The LEFT JOIN … IS NULL from the basic form is one form; NOT EXISTS is the other and often reads most directly — “keep the row if there is no matching master”:

  1: SELECT
  2:     T01.customer_id
  3:    ,T01.country_code
  4: FROM
  5:    staging.customer T01
  6: WHERE
  7:    T01.country_code IS NOT NULL
  8:    AND NOT EXISTS (
  9:       SELECT 1
 10:       FROM   staging.country T02
 11:       WHERE  T02.country_code = T01.country_code
 12:    );
 13: -- 3 | XX     identical to the LEFT JOIN form

Now the trap. The most obvious phrasing is NOT IN — “keep the row if its code is not in the list of countries.” It works flawlessly as long as the reference column holds no NULL. But the moment a single master row has a NULL code — a placeholder, an import error, or simply an unconstrained staging column — NOT IN stops reporting anything for every row:

  1: SELECT
  2:     T01.customer_id
  3:    ,T01.country_code
  4: FROM
  5:    staging.customer T01
  6: WHERE
  7:    T01.country_code NOT IN (
  8:       SELECT country_code
  9:       FROM   staging.country
 10:    );
 11: -- (0 rows)     the check runs green and finds NOTHING

The reason lies in three-valued logic. The reference subquery returns four values — including the single NULL:

staging.country
DE
FR
US
NULL ← the bad row

For the real orphan XX, Postgres evaluates NOT IN as a chain of inequality comparisons:

'XX' NOT IN ('DE', 'FR', 'US', NULL)
  =  'XX' <> 'DE'  AND  'XX' <> 'FR'  AND  'XX' <> 'US'  AND  'XX' <> NULL
  =  TRUE          AND  TRUE          AND  TRUE          AND  UNKNOWN
  =  UNKNOWN

The last comparison 'XX' <> NULL yields not FALSE but UNKNOWN — the database cannot claim that XX is unequal to an unknown value (it might well be XX). And anything AND UNKNOWN never becomes TRUE. But the WHERE keeps only TRUE rows: for every row the expression falls to UNKNOWN, none passes. The check runs green and overlooks the real orphan XX — the worst case of a quality check: not a false hit, but silent pass-through.

If you want to stay with NOT IN, you have to exclude the NULL from the subquery:

  1: SELECT
  2:     T01.customer_id
  3:    ,T01.country_code
  4: FROM
  5:    staging.customer T01
  6: WHERE
  7:    T01.country_code IS NOT NULL
  8:    AND T01.country_code NOT IN (
  9:       SELECT country_code
 10:       FROM   staging.country
 11:       WHERE  country_code IS NOT NULL
 12:    );
 13: -- 3 | XX     correct again now

That works — but it is exactly the extra effort you never need with LEFT JOIN … IS NULL and NOT EXISTS. Hence the simple rule of thumb: for orphan checks use NOT EXISTS or LEFT JOIN … IS NULL, and avoid NOT IN for this job.

“Unknown” Is Not “Empty”

In the robust variants — the LEFT JOIN basic form and the NOT EXISTS version (and likewise in the repaired NOT IN variant) — the condition T01.country_code IS NOT NULL appears. It is the second quiet stumbling block. An orphaned record has a value that points nowhere (XX — does not exist); distinct from it is the row that has no value at all (country_code IS NULL — customer 4 in the demo data).

Both are data errors, but different ones: an unknown reference is an integrity problem (routine 3, this article), a missing required value a completeness problem (routine 1, the job of Validating Data with SQL). Lump them into one check and the same row ends up in two findings with two messages, blurring the error table. The IS NOT NULL condition draws the line cleanly: this check reports only “points to something nonexistent,” not “points to nothing.”

Composite Foreign Keys

The foreign key is rarely a single column. A sales record might reference a reference table jointly via region_code and product_code — a table recording which product is carried in which region. The record is orphaned when the complete pair is missing from the master, even if each part occurs on its own. Every key part belongs in the ON clause:

  1: SELECT
  2:     T01.sale_id
  3:    ,T01.region_code
  4:    ,T01.product_code
  5: FROM
  6:    staging.sales T01
  7:    LEFT JOIN staging.product_region T02
  8:    ON
  9:           T02.region_code  = T01.region_code
 10:       AND T02.product_code = T01.product_code
 11: WHERE
 12:        T01.region_code  IS NOT NULL
 13:    AND T01.product_code IS NOT NULL
 14:    AND T02.region_code  IS NULL;
 15: -- 2 | US | B     the pair (US, B) has no master

The hit is sale 2 with (US, B): the region US exists in the master, the product B too — but not the combination (US, B). Forget a key column in the ON clause and you check a coarser key and get wrong results: with only region_code in the join, (US, B) would count as covered because US does occur, and the orphan would go undetected. It is enough for the master side of one key part to be NULL (line 14 checks region_code) to detect the missing partner.

Self-Referencing Relationships

Some tables reference themselves. An employee hierarchy keeps in manager_id the employee_id of the supervisor — from the same table. A record is orphaned when its manager_id points to no existing employee_id. The check is a self-join: the same table twice, T02 as the “manager side”:

  1: SELECT
  2:     T01.employee_id
  3:    ,T01.manager_id
  4: FROM
  5:    staging.employee T01
  6:    LEFT JOIN staging.employee T02
  7:    ON
  8:      T02.employee_id = T01.manager_id
  9: WHERE
 10:        T01.manager_id  IS NOT NULL
 11:    AND T02.employee_id IS NULL;
 12: -- 4 | 99     manager_id 99 has no matching employee_id

Again the IS NOT NULL condition (line 10) does the decisive work, here with a different meaning: the root of the hierarchy — the CEO with no supervisor — legitimately has manager_id IS NULL. That is not an orphan but the expected end of the chain. Without the condition, every root would be falsely reported as an orphan. Correctly, only employee 4 remains, whose manager_id 99 points to no one.

Performance: the Index on the Child Column

At its core the orphan check is an anti-join, and Postgres implements that efficiently. With an index on the child column (the foreign key), the planner can choose a Hash Anti Join or an index-backed plan for NOT EXISTS or LEFT JOIN … IS NULL, rather than looking into the master separately for every child row. You cannot force this — the planner decides by cost and often still picks a sequential scan for small sets — but on large tables the index on the foreign-key column can tip the balance. That very index is often missing in raw-loaded staging tables, precisely because no constraints sit there.

At its core it remains a pass that matches each child row once against the master. So the same lever as in any staging check applies here too: limit the check to the currently loaded batch with a load marker (load_idbatch_id) instead of re-checking the entire stock each time. And if the target carries a foreign-key constraint on the relationship anyway, the upfront check is exactly the work the database would do at load time — just in the source, where a violation is reportable rather than fatal.

Generic via the Config Routine

Hard-coding this check is no more practical than the others. It becomes generic via a configuration row, from which the framework article’s runner builds the LEFT JOIN … IS NULL against the configured reference table at runtime:

  1: INSERT INTO dq.check_rule
  2:    (check_type, schema_name, table_name, id1_column, check_column, ref_schema, ref_table, ref_column, severity, message)
  3: VALUES
  4:    ('lookup', 'staging', 'customer', 'customer_id', 'country_code', 'staging', 'country', 'country_code', 'E', 'Unknown country_code');

check_type = 'lookup' selects the integrity branch in the runner; check_column is the child column, ref_schema/ref_table/ref_column name the master. The runner inserts all identifiers via format() with %I (injection-safe) — and because the branch builds the NULL-robust LEFT JOIN … IS NULL, not NOT IN, the trap from this article is avoided in the generic code from the outset. An integrity check also needs no freely configured raw SQL and so does without the WHERE routine’s trust boundary. The full runner mechanics are in the framework article.

Bridge: the Same in SQL Server

The pattern is not Postgres-specific. In SQL Server, sp_executesql takes the role of EXECUTE format(), and QUOTENAME() protects identifiers instead of %ILEFT JOIN … IS NULL and NOT EXISTS are word for word the same — and just as important: the NOT IN-plus-NULL trap applies identically in SQL Server. There too a single NULL in the subquery turns the whole NOT IN check to UNKNOWN; three-valued logic is SQL standard, not a dialect detail. Structure, error table, and severity gate stay the same; only the tools for building the dynamic SQL differ.

FAQ

Why does NOT IN find no orphaned records?

Because a single NULL in the reference subquery paralyzes the whole check. code NOT IN (…, NULL) contains the comparison code <> NULL, which yields neither TRUE nor FALSE but UNKNOWN — and so the entire AND expression falls to UNKNOWN for every row. No row passes the WHERE, and the check reports zero hits. The fix: use NOT EXISTS or LEFT JOIN … IS NULL, or exclude the NULL from the subquery with WHERE … IS NOT NULL.

LEFT JOIN … IS NULL or NOT EXISTS — which is better?

Both are NULL-robust and return the same result; the planner often even optimizes them into the same anti-join. NOT EXISTS usually reads more directly (“keep the row if no master exists”) and needs no IS NULL condition on the master side. LEFT JOIN … IS NULL is a matter of taste and equivalent. The main thing is: not NOT IN.

How do I check a composite foreign key?

With a multi-column join: all key parts in the ON clause, combined with AND. A record is orphaned only when the complete tuple is missing from the master — checking the region alone or the product alone overlooks missing combinations. In the WHERE it suffices to filter on the NULL-filled master side of one key part.

How do I check a self-referencing table?

With a self-join: the same table twice, the second alias as the “parent side.” The ON clause links the foreign-key column (manager_id) to the primary key (employee_id) of the same table alias. Important: root nodes with manager_id IS NULL are legitimate and not orphans — exclude them with WHERE manager_id IS NOT NULL.

Does this work in SQL Server too?

Yes. LEFT JOIN … IS NULL and NOT EXISTS are word for word the same, sp_executesql replaces EXECUTE format(), and QUOTENAME() replaces %I. And the NOT IN-plus-NULL trap applies there just the same — three-valued logic is SQL standard, not a Postgres specialty.