GROUP BY … HAVING count(*) > 1 may be the most-typed SQL snippet in the world — and yet it trips up in two places almost nobody has on their radar: composite keys and NULL. When it comes to missing values in a unique column, SQL Server and Postgres disagree — one allows exactly one NULL, the other any number. Anyone trying to find duplicate rows without knowing this suddenly reports “duplicates” after a migration that were perfectly legal before.
The essentials up front:
- The basic form
GROUP BY … HAVING count(*) > 1finds duplicate keys — but it’s really a cardinality check: a key may occur at most n times (> 1for true uniqueness,> nfor set rules). - Composite keys are checked with a multi-column
GROUP BY— the completeness of the column list decides, not its order. - The NULL trap:
GROUP BYcollapses allNULLinto one group and reports it as a duplicate — while Postgres by default allows multipleNULLin aUNIQUEcolumn and SQL Server exactly one. That difference is migration-relevant. - A list of keys isn’t enough: for the error table you log every affected occurrence, not just the first.
Prerequisite: Postgres as the example engine and the shared checking scaffold from the framework article — a central error table dq.error into which every violation writes a row, and a staging layer that the checks run against. This article deepens the second of the three routines from there: the uniqueness check.
Contents
- Finding Duplicates: the Basic Form
- From Uniqueness to Cardinality: At Most n Occurrences
- Composite Keys
- The NULL Trap in UNIQUE: SQL Server vs. Postgres
- Logging Every Occurrence, Not Just One
- Performance: Here the Index Helps
- Generic via the Config Routine
- Bridge: the Same in SQL Server
- FAQ
- Related Articles
Finding Duplicates: the Basic Form
Anyone who wants to find duplicate rows almost always starts here: you group the table by the business key and keep only the groups that contain more than one row. For the rule “every customer number may appear at most once” it looks like this:
1: SELECT
2: customer_id
3: FROM
4: staging.customer
5: GROUP BY
6: customer_id
7: HAVING
8: count(*) > 1; -- 1
The result is customer number 1 — it appears twice in the table. What’s important is what does not come back here: not the duplicate rows, but the duplicate keys. Which concrete records are affected is a second question — see the section Logging Every Occurrence.
The business key here is exactly the column that identifies the record — the same one that carries the UNIQUE constraint at the target and lands in the error table as id1_column. This is the uniqueness dimension: does every key occur only as often as it’s allowed to?
From Uniqueness to Cardinality: At Most n Occurrences
The > 1 is not a law of nature but a parameter. “Unique” means “at most once” — but some rules allow a key more than once, just not arbitrarily often. A promo code may be redeemable at most three times, say:
1: SELECT
2: promo_code
3: FROM
4: staging.promo_redemption
5: GROUP BY
6: promo_code
7: HAVING
8: count(*) > 3; -- SAVE10 (redeemed 4x, the limit is 3)
“Uniqueness” thus becomes a maximum cardinality: a key may occur at most n times. The 3 is just another value for n — > 1 is the special case of true uniqueness. In the generic scaffold this n lives in a configuration column (max_occurrence), not in code. So one routine covers both hard uniqueness and softer set rules.
Composite Keys
The business key is rarely a single column. An order line is unique only through order_id and product_id; the same order must not carry the same product as a line twice. The check then groups over all columns of the key:
1: SELECT
2: order_id
3: ,product_id
4: FROM
5: staging.order_item
6: GROUP BY
7: order_id
8: ,product_id
9: HAVING
10: count(*) > 1; -- 1000 | A-1
What’s decisive is the completeness of the column list, not its order: GROUP BY order_id, product_id and GROUP BY product_id, order_id produce the same groups. Forget a key column, and you check a different, coarser key — reporting either too many or too few duplicates. The most common mistake is to group only by order_id and then wonder why every multi-line order shows up as a “duplicate”.
The NULL Trap in UNIQUE: SQL Server vs. Postgres
So far count(*) has been harmless. The moment the checked column may contain NULL, though, things get tricky. Take an optional but unique tax id, tax_id:
1: SELECT
2: tax_id
3: ,count(*) AS n
4: FROM
5: staging.customer
6: GROUP BY
7: tax_id
8: HAVING
9: count(*) > 1;
10:-- DE0002 | 2 real duplicate (two customers, same tax id)
11:-- NULL | 2 two MISSING tax ids — in ONE group
GROUP BY throws all NULL into one group — not because two NULL were equal (NULL = NULL yields unknown), but because grouping treats missing values as belonging together. Two customers without a tax id therefore appear as a “duplicate”, even though both are simply missing the value. That’s not a uniqueness problem but a completeness one — and belongs in a separate check (see Validating Data with SQL).
Now comes the part that hurts during a migration. What a UNIQUE constraint does with NULL is not the same across engines:
1: -- Postgres default: multiple NULLs are allowed in a UNIQUE column
2: CREATE TABLE t_default (tax_id text UNIQUE);
3: INSERT INTO t_default VALUES (NULL), (NULL); -- OK: two NULLs side by side
4:
5: -- Postgres 15+: treat NULL like an ordinary value
6: CREATE TABLE t_strict (tax_id text UNIQUE NULLS NOT DISTINCT);
7: INSERT INTO t_strict VALUES (NULL), (NULL); -- ERROR: second NULL collides
Postgres by default regards two NULL as distinct (NULLS DISTINCT) — any number of missing values is allowed in a UNIQUE column. SQL Server does the opposite: there, a UNIQUE constraint admits exactly one NULL, the second is rejected as a duplicate. SQL Server’s UNIQUE constraint thus behaves like Postgres’ NULLS NOT DISTINCT (from version 15) — not like the Postgres default.
The migration consequence is concrete: a Postgres table with three legally coexisting NULL in a UNIQUE column cannot be lifted to SQL Server one to one — there the second and third NULL collide. To check for this in advance, you have to decide which semantics you’re reproducing. For a Postgres target with default UNIQUE, you exclude missing keys from the check:
1: SELECT
2: tax_id
3: FROM
4: staging.customer
5: WHERE
6: tax_id IS NOT NULL
7: GROUP BY
8: tax_id
9: HAVING
10: count(*) > 1; -- only DE0002
For a SQL Server target (or a Postgres target with NULLS NOT DISTINCT) you drop the WHERE — then the second NULL is rightly a finding. As an aside: count(*) counts all rows of the group, count(tax_id) would ignore the NULL and reach 0 for the NULL group. For the cardinality check count(*) is the right counter — you do want to know how many rows share the key.
Logging Every Occurrence, Not Just One
Now to the open question from the basic form: it delivers only the duplicate keys — for the error table you need every affected row, otherwise half the records go missing from dq.error. The direct path is a back-join of the duplicate keys onto the table. As a separate, named step it reads most clearly as a Common Table Expression (CTE) — the duplicate keys get a name that’s then joined back:
1: WITH
2: CTE_duplicate_keys AS
3: (
4: SELECT
5: customer_id
6: FROM
7: staging.customer
8: GROUP BY
9: customer_id
10: HAVING
11: count(*) > 1
12: )
13: SELECT
14: T01.*
15: FROM
16: staging.customer T01
17: INNER JOIN CTE_duplicate_keys T02
18: ON
19: T02.customer_id = T01.customer_id;
More elegant — and without the explicit back-join — is a window function. count(*) OVER (PARTITION BY …) attaches the group size to every row without aggregating:
1: SELECT
2: customer_id
3: ,email
4: ,count(*) OVER (PARTITION BY customer_id) AS n_total
5: FROM
6: staging.customer;
7: -- every row keeps its identity; n_total = size of its key group
Everything with n_total > 1 is a duplicate occurrence. If you additionally need a “first” row (to keep it and mark the rest, say), ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY …) gives the running number within the group. For pure logging, though, every affected record counts equally — n_total > 1 is enough, and each of those rows becomes an entry in dq.error.
Performance: Here the Index Helps
Unlike the free predicates of the WHERE routine, the uniqueness check is index-friendly. A suitable index on the key can let the optimizer choose a GroupAggregate over already-sorted data, fed by an index scan, instead of a HashAggregate (where the whole table goes into a hash table). You can’t force it — the planner decides by cost and, for smaller volumes, often reaches for the HashAggregate anyway. On large tables, though, the index can tip the balance.
Still, at its core it remains a pass over all rows — you can’t spot duplicates without having seen every value once. The same lever as in any staging check therefore applies here too: a load marker (load_id, batch_id) confines the check to the currently loaded batch. That way the entire dataset doesn’t have to be regrouped every time. And if the target carries a UNIQUE index on the key anyway, the upfront check is exactly the work the database would do during the load regardless — just in the source, where a violation is reportable instead of fatal.
Generic via the Config Routine
Hard-coded, this check isn’t practical either. It becomes generic via a configuration row from which the framework article’s runner builds the GROUP BY … HAVING count(*) > max_occurrence at runtime and joins the hits back onto the table:
1: INSERT INTO dq.check_rule
2: (check_type, schema_name, table_name, id1_column, check_column, max_occurrence, severity, message)
3: VALUES
4: ('unique', 'staging', 'customer', 'customer_id', 'customer_id', 1, 'E', 'customer_id not unique')
5: ,('unique', 'staging', 'promo_redemption', 'promo_code', 'promo_code', 3, 'W', 'Promo redeemed too often');
check_type = 'unique' selects the uniqueness branch in the runner; max_occurrence supplies the n. Schema, table and column names are inserted via format() with %I (injection-safe) — unlike the WHERE routine, a uniqueness check needs no freely configured raw SQL and so does without the trust boundary required there. Composite keys are covered by the runner via the further key columns (id2_column, id3_column). The full runner mechanics are in the framework article.
Bridge: the Same in SQL Server
The pattern isn’t Postgres-specific. In SQL Server sp_executesql takes the role of EXECUTE format(), and QUOTENAME() protects identifiers instead of %I. GROUP BY … HAVING count(*) > 1, the back-join and ROW_NUMBER() OVER (PARTITION BY …) are word for word the same — window functions are standard in both engines.
The one real difference is the NULL semantics from the section above: SQL Server’s UNIQUE constraint admits exactly one NULL. To allow multiple missing values there, you need a filtered unique index (CREATE UNIQUE INDEX … WHERE tax_id IS NOT NULL) — the counterpart to Postgres’ default behavior. Structure, error table and severity gate stay identical; only the dialect tools and this one semantic difference vary.
FAQ
With a multi-column GROUP BY: put all columns of the business key in the GROUP BY list, then HAVING count(*) > 1. The order of the columns doesn’t matter, their completeness does — leave a key column out and you check a coarser key and get false hits. For (order_id, product_id) you group over both columns, not just order_id.
Because Postgres by default regards two NULL as distinct (NULLS DISTINCT): a missing value is not equal to another missing value. From Postgres 15 on, UNIQUE NULLS NOT DISTINCT reverses that and allows only one NULL — exactly the behavior SQL Server shows out of the box.
The basic form delivers only the duplicate keys. To get every affected row, join the key list back onto the table or use count(*) OVER (PARTITION BY key) and filter on > 1. That way every occurrence lands in the error table — important, because otherwise half the duplicate rows go unlogged.
This article only finds: it reports duplicates transparently into the error table and leaves the data unchanged. Deleting (for example via DELETE with ROW_NUMBER()) is remediation — a deliberately separate follow-up layer that the cluster does not cover. First find and classify, then decide what happens to the hits.
At its core, yes. GROUP BY … HAVING count(*) > 1, the back-join and ROW_NUMBER() are identical, sp_executesql replaces EXECUTE format(), QUOTENAME() replaces %I. The one difference: SQL Server’s UNIQUE admits exactly one NULL; multiple missing values need a filtered index WHERE column IS NOT NULL.
Related Articles
- Checking Data Quality with SQL — the framework whose uniqueness routine this article deepens: shared error table, severity gate, the dynamic runner.
- Validating Data with SQL — the sibling routine for validity and completeness, including the NULL trap in the WHERE clause.
- Postgres Table Conventions — how
UNIQUEand key constraints are cleanly modeled at the target schema. - Referential integrity (finding orphaned records) is the third routine of the series, and the concept frame for the data-quality dimensions — the corresponding articles are coming.