Find Duplicate Rows in SQL: COUNT(*)

GROUP BY … HAVING count(*) > 1 may be the most-typed SQL snippet in the world — and yet it trips up in two places almost nobody has on their radar: composite keys and NULL. When it comes to missing values in a unique column, SQL Server and Postgres disagree — one allows exactly one NULL, the other any number. Anyone trying to find duplicate rows without knowing this suddenly reports “duplicates” after a migration that were perfectly legal before.

The essentials up front:

The basic form GROUP BY … HAVING count(*) > 1 finds duplicate keys — but it’s really a cardinality check: a key may occur at most n times (> 1 for true uniqueness, > n for set rules).
Composite keys are checked with a multi-column GROUP BY — the completeness of the column list decides, not its order.
The NULL trap: GROUP BY collapses all NULL into one group and reports it as a duplicate — while Postgres by default allows multiple NULL in a UNIQUE column and SQL Server exactly one. That difference is migration-relevant.
A list of keys isn’t enough: for the error table you log every affected occurrence, not just the first.

Prerequisite: Postgres as the example engine and the shared checking scaffold from the framework article — a central error table dq.error into which every violation writes a row, and a staging layer that the checks run against. This article deepens the second of the three routines from there: the uniqueness check.

Finding Duplicates: the Basic Form
From Uniqueness to Cardinality: At Most n Occurrences
Composite Keys
The NULL Trap in UNIQUE: SQL Server vs. Postgres
Logging Every Occurrence, Not Just One
Performance: Here the Index Helps
Generic via the Config Routine
Bridge: the Same in SQL Server
FAQ
Related Articles

Finding Duplicates: the Basic Form

Anyone who wants to find duplicate rows almost always starts here: you group the table by the business key and keep only the groups that contain more than one row. For the rule “every customer number may appear at most once” it looks like this:

  1: SELECT
  2:    customer_id
  3: FROM
  4:    staging.customer
  5: GROUP BY
  6:    customer_id
  7: HAVING
  8:    count(*) > 1;          -- 1

  1: SELECT
  2:    customer_id
  3: FROM
  4:    staging.customer
  5: GROUP BY
  6:    customer_id
  7: HAVING
  8:    count(*) > 1;          -- 1

The result is customer number 1 — it appears twice in the table. What’s important is what does not come back here: not the duplicate rows, but the duplicate keys. Which concrete records are affected is a second question — see the section Logging Every Occurrence.

The business key here is exactly the column that identifies the record — the same one that carries the UNIQUE constraint at the target and lands in the error table as id1_column. This is the uniqueness dimension: does every key occur only as often as it’s allowed to?

From Uniqueness to Cardinality: At Most n Occurrences

The > 1 is not a law of nature but a parameter. “Unique” means “at most once” — but some rules allow a key more than once, just not arbitrarily often. A promo code may be redeemable at most three times, say:

  1: SELECT
  2:    promo_code
  3: FROM
  4:    staging.promo_redemption
  5: GROUP BY
  6:    promo_code
  7: HAVING
  8:    count(*) > 3;          -- SAVE10  (redeemed 4x, the limit is 3)

  1: SELECT
  2:    promo_code
  3: FROM
  4:    staging.promo_redemption
  5: GROUP BY
  6:    promo_code
  7: HAVING
  8:    count(*) > 3;          -- SAVE10  (redeemed 4x, the limit is 3)

“Uniqueness” thus becomes a maximum cardinality: a key may occur at most n times. The 3 is just another value for n — > 1 is the special case of true uniqueness. In the generic scaffold this n lives in a configuration column (max_occurrence), not in code. So one routine covers both hard uniqueness and softer set rules.

Composite Keys

The business key is rarely a single column. An order line is unique only through order_id and product_id; the same order must not carry the same product as a line twice. The check then groups over all columns of the key:

  1: SELECT
  2:     order_id
  3:    ,product_id
  4: FROM
  5:    staging.order_item
  6: GROUP BY
  7:     order_id
  8:    ,product_id
  9: HAVING
 10:    count(*) > 1;          -- 1000 | A-1

  1: SELECT
  2:     order_id
  3:    ,product_id
  4: FROM
  5:    staging.order_item
  6: GROUP BY
  7:     order_id
  8:    ,product_id
  9: HAVING
 10:    count(*) > 1;          -- 1000 | A-1

What’s decisive is the completeness of the column list, not its order: GROUP BY order_id, product_id and GROUP BY product_id, order_id produce the same groups. Forget a key column, and you check a different, coarser key — reporting either too many or too few duplicates. The most common mistake is to group only by order_id and then wonder why every multi-line order shows up as a “duplicate”.

The NULL Trap in UNIQUE: SQL Server vs. Postgres

So far count(*) has been harmless. The moment the checked column may contain NULL, though, things get tricky. Take an optional but unique tax id, tax_id:

  1: SELECT
  2:     tax_id
  3:    ,count(*) AS n
  4: FROM
  5:    staging.customer
  6: GROUP BY
  7:    tax_id
  8: HAVING
  9:    count(*) > 1;
 10:-- DE0002 | 2     real duplicate (two customers, same tax id)
 11:-- NULL   | 2     two MISSING tax ids — in ONE group

  1: SELECT
  2:     tax_id
  3:    ,count(*) AS n
  4: FROM
  5:    staging.customer
  6: GROUP BY
  7:    tax_id
  8: HAVING
  9:    count(*) > 1;
 10:-- DE0002 | 2     real duplicate (two customers, same tax id)
 11:-- NULL   | 2     two MISSING tax ids — in ONE group

GROUP BY throws all NULL into one group — not because two NULL were equal (NULL = NULL yields unknown), but because grouping treats missing values as belonging together. Two customers without a tax id therefore appear as a “duplicate”, even though both are simply missing the value. That’s not a uniqueness problem but a completeness one — and belongs in a separate check (see Validating Data with SQL).

Now comes the part that hurts during a migration. What a UNIQUE constraint does with NULL is not the same across engines:

  1: -- Postgres default: multiple NULLs are allowed in a UNIQUE column
  2: CREATE TABLE t_default (tax_id text UNIQUE);
  3: INSERT INTO t_default VALUES (NULL), (NULL);   -- OK: two NULLs side by side
  4:
  5: -- Postgres 15+: treat NULL like an ordinary value
  6: CREATE TABLE t_strict (tax_id text UNIQUE NULLS NOT DISTINCT);
  7: INSERT INTO t_strict VALUES (NULL), (NULL);    -- ERROR: second NULL collides

  1: -- Postgres default: multiple NULLs are allowed in a UNIQUE column
  2: CREATE TABLE t_default (tax_id text UNIQUE);
  3: INSERT INTO t_default VALUES (NULL), (NULL);   -- OK: two NULLs side by side
  4:
  5: -- Postgres 15+: treat NULL like an ordinary value
  6: CREATE TABLE t_strict (tax_id text UNIQUE NULLS NOT DISTINCT);
  7: INSERT INTO t_strict VALUES (NULL), (NULL);    -- ERROR: second NULL collides

Postgres by default regards two NULL as distinct (NULLS DISTINCT) — any number of missing values is allowed in a UNIQUE column. SQL Server does the opposite: there, a UNIQUE constraint admits exactly one NULL, the second is rejected as a duplicate. SQL Server’s UNIQUE constraint thus behaves like Postgres’ NULLS NOT DISTINCT (from version 15) — not like the Postgres default.

The migration consequence is concrete: a Postgres table with three legally coexisting NULL in a UNIQUE column cannot be lifted to SQL Server one to one — there the second and third NULL collide. To check for this in advance, you have to decide which semantics you’re reproducing. For a Postgres target with default UNIQUE, you exclude missing keys from the check:

  1: SELECT
  2:    tax_id
  3: FROM
  4:    staging.customer
  5: WHERE
  6:    tax_id IS NOT NULL
  7: GROUP BY
  8:    tax_id
  9: HAVING
 10:    count(*) > 1;          -- only DE0002

  1: SELECT
  2:    tax_id
  3: FROM
  4:    staging.customer
  5: WHERE
  6:    tax_id IS NOT NULL
  7: GROUP BY
  8:    tax_id
  9: HAVING
 10:    count(*) > 1;          -- only DE0002

For a SQL Server target (or a Postgres target with NULLS NOT DISTINCT) you drop the WHERE — then the second NULL is rightly a finding. As an aside: count(*) counts all rows of the group, count(tax_id) would ignore the NULL and reach 0 for the NULL group. For the cardinality check count(*) is the right counter — you do want to know how many rows share the key.

Logging Every Occurrence, Not Just One

Now to the open question from the basic form: it delivers only the duplicate keys — for the error table you need every affected row, otherwise half the records go missing from dq.error. The direct path is a back-join of the duplicate keys onto the table. As a separate, named step it reads most clearly as a Common Table Expression (CTE) — the duplicate keys get a name that’s then joined back:

  1: WITH
  2: CTE_duplicate_keys AS
  3: (
  4:    SELECT
  5:       customer_id
  6:    FROM
  7:       staging.customer
  8:    GROUP BY
  9:       customer_id
 10:    HAVING
 11:       count(*) > 1
 12: )
 13: SELECT
 14:    T01.*
 15: FROM
 16:    staging.customer T01
 17:    INNER JOIN CTE_duplicate_keys T02
 18:    ON
 19:      T02.customer_id = T01.customer_id;

  1: WITH
  2: CTE_duplicate_keys AS
  3: (
  4:    SELECT
  5:       customer_id
  6:    FROM
  7:       staging.customer
  8:    GROUP BY
  9:       customer_id
 10:    HAVING
 11:       count(*) > 1
 12: )
 13: SELECT
 14:    T01.*
 15: FROM
 16:    staging.customer T01
 17:    INNER JOIN CTE_duplicate_keys T02
 18:    ON
 19:      T02.customer_id = T01.customer_id;

More elegant — and without the explicit back-join — is a window function. count(*) OVER (PARTITION BY …) attaches the group size to every row without aggregating:

  1: SELECT
  2:     customer_id
  3:    ,email
  4:    ,count(*) OVER (PARTITION BY customer_id) AS n_total
  5: FROM
  6:    staging.customer;
  7: -- every row keeps its identity; n_total = size of its key group

  1: SELECT
  2:     customer_id
  3:    ,email
  4:    ,count(*) OVER (PARTITION BY customer_id) AS n_total
  5: FROM
  6:    staging.customer;
  7: -- every row keeps its identity; n_total = size of its key group

Everything with n_total > 1 is a duplicate occurrence. If you additionally need a “first” row (to keep it and mark the rest, say), ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY …) gives the running number within the group. For pure logging, though, every affected record counts equally — n_total > 1 is enough, and each of those rows becomes an entry in dq.error.

Performance: Here the Index Helps

Unlike the free predicates of the WHERE routine, the uniqueness check is index-friendly. A suitable index on the key can let the optimizer choose a GroupAggregate over already-sorted data, fed by an index scan, instead of a HashAggregate (where the whole table goes into a hash table). You can’t force it — the planner decides by cost and, for smaller volumes, often reaches for the HashAggregate anyway. On large tables, though, the index can tip the balance.

Still, at its core it remains a pass over all rows — you can’t spot duplicates without having seen every value once. The same lever as in any staging check therefore applies here too: a load marker (load_id, batch_id) confines the check to the currently loaded batch. That way the entire dataset doesn’t have to be regrouped every time. And if the target carries a UNIQUE index on the key anyway, the upfront check is exactly the work the database would do during the load regardless — just in the source, where a violation is reportable instead of fatal.

Generic via the Config Routine

Hard-coded, this check isn’t practical either. It becomes generic via a configuration row from which the framework article’s runner builds the GROUP BY … HAVING count(*) > max_occurrence at runtime and joins the hits back onto the table:

  1: INSERT INTO dq.check_rule
  2:    (check_type, schema_name, table_name, id1_column, check_column, max_occurrence, severity, message)
  3: VALUES
  4:     ('unique', 'staging', 'customer',         'customer_id', 'customer_id', 1, 'E', 'customer_id not unique')
  5:    ,('unique', 'staging', 'promo_redemption', 'promo_code',  'promo_code',  3, 'W', 'Promo redeemed too often');

  1: INSERT INTO dq.check_rule
  2:    (check_type, schema_name, table_name, id1_column, check_column, max_occurrence, severity, message)
  3: VALUES
  4:     ('unique', 'staging', 'customer',         'customer_id', 'customer_id', 1, 'E', 'customer_id not unique')
  5:    ,('unique', 'staging', 'promo_redemption', 'promo_code',  'promo_code',  3, 'W', 'Promo redeemed too often');

check_type = 'unique' selects the uniqueness branch in the runner; max_occurrence supplies the n. Schema, table and column names are inserted via format() with %I (injection-safe) — unlike the WHERE routine, a uniqueness check needs no freely configured raw SQL and so does without the trust boundary required there. Composite keys are covered by the runner via the further key columns (id2_column, id3_column). The full runner mechanics are in the framework article.

Bridge: the Same in SQL Server

The pattern isn’t Postgres-specific. In SQL Server sp_executesql takes the role of EXECUTE format(), and QUOTENAME() protects identifiers instead of %I. GROUP BY … HAVING count(*) > 1, the back-join and ROW_NUMBER() OVER (PARTITION BY …) are word for word the same — window functions are standard in both engines.

The one real difference is the NULL semantics from the section above: SQL Server’s UNIQUE constraint admits exactly one NULL. To allow multiple missing values there, you need a filtered unique index (CREATE UNIQUE INDEX … WHERE tax_id IS NOT NULL) — the counterpart to Postgres’ default behavior. Structure, error table and severity gate stay identical; only the dialect tools and this one semantic difference vary.

FAQ

How do I find duplicates across multiple columns?

With a multi-column GROUP BY: put all columns of the business key in the GROUP BY list, then HAVING count(*) > 1. The order of the columns doesn’t matter, their completeness does — leave a key column out and you check a coarser key and get false hits. For (order_id, product_id) you group over both columns, not just order_id.

Why does Postgres allow multiple NULLs in a UNIQUE column?

Because Postgres by default regards two NULL as distinct (NULLS DISTINCT): a missing value is not equal to another missing value. From Postgres 15 on, UNIQUE NULLS NOT DISTINCT reverses that and allows only one NULL — exactly the behavior SQL Server shows out of the box.

How do I log all duplicates, not just the first?

The basic form delivers only the duplicate keys. To get every affected row, join the key list back onto the table or use count(*) OVER (PARTITION BY key) and filter on > 1. That way every occurrence lands in the error table — important, because otherwise half the duplicate rows go unlogged.

Finding duplicates vs. deleting them — what’s the difference?

This article only finds: it reports duplicates transparently into the error table and leaves the data unchanged. Deleting (for example via DELETE with ROW_NUMBER()) is remediation — a deliberately separate follow-up layer that the cluster does not cover. First find and classify, then decide what happens to the hits.

Does this work in SQL Server too?

At its core, yes. GROUP BY … HAVING count(*) > 1, the back-join and ROW_NUMBER() are identical, sp_executesql replaces EXECUTE format(), QUOTENAME() replaces %I. The one difference: SQL Server’s UNIQUE admits exactly one NULL; multiple missing values need a filtered index WHERE column IS NOT NULL.

Checking Data Quality with SQL — the framework whose uniqueness routine this article deepens: shared error table, severity gate, the dynamic runner.
Validating Data with SQL — the sibling routine for validity and completeness, including the NULL trap in the WHERE clause.
Postgres Table Conventions — how UNIQUE and key constraints are cleanly modeled at the target schema.
Finding Orphaned Records with SQL — the sibling routine for referential integrity: LEFT JOIN … IS NULL vs. NOT EXISTS vs. NOT IN and the NOT-IN-plus-NULL trap.
Data Quality: Dimensions and Error Classes — the concept frame for the data-quality dimensions: error classes, the criteria canon, and the coverage map where uniqueness has its place.

Finding Duplicates with SQL — From COUNT(*) > 1 to Composite Keys

Contents

Finding Duplicates: the Basic Form

From Uniqueness to Cardinality: At Most n Occurrences

Composite Keys

The NULL Trap in UNIQUE: SQL Server vs. Postgres

Logging Every Occurrence, Not Just One

Performance: Here the Index Helps

Generic via the Config Routine

Bridge: the Same in SQL Server

FAQ

Contents

Finding Duplicates: the Basic Form

From Uniqueness to Cardinality: At Most n Occurrences

Composite Keys

The NULL Trap in UNIQUE: SQL Server vs. Postgres

Logging Every Occurrence, Not Just One

Performance: Here the Index Helps

Generic via the Config Routine

Bridge: the Same in SQL Server

FAQ

Related Articles