Verifying the Migration — Data Quality and Row Reconciliation After the Move

The row counts match — table by table, source against target, all green. And yet the migration isn’t finished. In one column, NULL values have turned into empty strings; in another, the detour through a CSV file rounded the last decimal of an amount; and a handful of accented characters collapsed into question marks. Same count is not the same data — and that’s exactly the gap a plain count(*) comparison never closes.

This article is the verification phase: the “are you actually done?” step after moving from SQL Server to PostgreSQL. It builds the reconciliation up in stages — from cheap row counting through interpretable column aggregates to a row hash for the critical tables — and shows which silent differences only surface along the way.

The key points up front:

  • Row reconciliation is the necessary but not sufficient condition: count(*) per table, as a difference report across all tables at once.
  • Column aggregates (summin/max, count of non-NULL) are the robust, engine-neutral first content check — they show where something drifts, not just that (but prove no equality).
  • Row hashes (HASHBYTES('MD5', ...) ↔ md5(string_agg(...))) condense the entire row content — but only with deterministic ordering and byte-identical normalization.
  • The expensive tables you verify by sampling over key ranges — never by random samples that pull different rows on each side.

Prerequisite: SQL Server as source, PostgreSQL 14+ as target (the examples run against PostgreSQL 16). Aggregate functions (countsummin/max) are assumed; the Postgres specifics (md5string_aggquery_to_xml, catalog estimates) are shown. The actual data transfer (bcpCOPYpgloader) and the type mapping are separate topics — both linked below; here we deal with the target-vs-source comparison that comes afterwards.

Contents

Stage 1: Row Reconciliation

The first check is the cheapest and the most important: does every table have exactly as many rows after the transfer as before? If not, the migration is demonstrably incomplete — no further check needs to run until that difference is resolved.

For a single table it’s trivial and identical on both engines:

  1: -- SQL Server (source)
  2: SELECT count(*) FROM dbo.customer;
  3:
  4: -- PostgreSQL (target)
  5: SELECT count(*) FROM customer;

With three tables you type that by hand. With three hundred you don’t. On the Postgres side, all row counts can be pulled with a single statement — the trick is query_to_xml, which runs a dynamic SELECT count(*) per table and returns the result as XML, from which xpath extracts the number again:

  1: SELECT
  2:     table_name
  3:    ,(xpath(
  4:        '/row/c/text()'
  5:       ,query_to_xml(
  6:           format('SELECT count(*) AS c FROM %I.%I', table_schema, table_name)
  7:          ,false, true, ''
  8:        )
  9:     ))[1]::text::bigint AS row_count
 10: FROM
 11:    information_schema.tables
 12: WHERE
 13:        table_schema = 'public'
 14:    AND table_type   = 'BASE TABLE'
 15: ORDER BY
 16:    table_name;

The result is a list of table_name | row_count across the whole schema. You pull the exact same list on the SQL Server side (there via sys.tables and sys.partitions, for instance), put the two side by side, and look for the rows where the numbers diverge.

One shortcut is worth a glance — and a warning. PostgreSQL keeps an estimated row count per table in the system catalog:

  1: SELECT
  2:     relname            AS table_name
  3:    ,reltuples::bigint  AS estimated_rows
  4: FROM
  5:    pg_class
  6: WHERE
  7:        relkind      = 'r'
  8:    AND relnamespace = 'public'::regnamespace
  9: ORDER BY
 10:    relname;

That’s there in milliseconds, even with billions of rows — but reltuples is an estimate maintained by the planner, and it’s only roughly right after an ANALYZE, and even then not down to the row. For a first smoke test (“roughly the same order of magnitude?”) it’s exactly the right tool; as proof that no record is missing, it is not enough. Verification needs the exact count(*).

And even the exact count(*) is only the necessary condition. It says: the same number of rows. It says nothing about whether those rows hold the same content.

Stage 2: Checksums, Not Just Counts

To compare the content, there are two ways — and the obvious one isn’t the best. Let’s start with the robust one.

Column aggregates. Instead of hashing every row, you compute a handful of aggregates per table, each interpretable on its own:

  1: -- identical on both engines (SQL Server: FROM dbo.customer)
  2: SELECT
  3:     count(*)            AS row_count
  4:    ,count(email)        AS non_null_emails
  5:    ,sum(credit_limit)   AS sum_credit
  6:    ,min(created_at)     AS min_created
  7:    ,max(created_at)     AS max_created
  8: FROM
  9:    customer;

This check is so valuable because it’s engine-neutralsumminmax and count mean the same thing in SQL Server and Postgres, and the result is a small row of numbers you can put side by side directly. If sum_credit is off by a cent, you know immediately: it’s the amounts, not the names. If max_created is off, it’s a date problem. The check shows where, not just that — and that’s what saves the most time when hunting for the cause.

One caveat belongs with it, though: matching aggregates prove no equality. If one value shifts by +10 and another in the same column by −10, the sum stays identical; the same min/max bounds can be produced with entirely different values in between. The aggregates are a very robust first content comparison — the gap-free reconciliation is delivered only by the row hash for the critical tables.

Row hashes. The more thorough way condenses each row into a hash and aggregates all row hashes into one table checksum. (MD5 serves here as a fast checksum, not for cryptographic reasons — the theoretical collision risk is irrelevant for a data reconciliation; a matching hash therefore means “the same content with near-certainty”.) In Postgres:

  1: SELECT
  2:    md5(string_agg(row_hash, '' ORDER BY customer_id)) AS table_hash
  3: FROM (
  4:    SELECT
  5:        customer_id
  6:       ,md5(concat_ws('|'
  7:           ,customer_id::text
  8:          ,coalesce(email, '\N')
  9:          ,coalesce(credit_limit::text, '\N')
 10:        )) AS row_hash
 11:    FROM
 12:       customer
 13: ) T01;

On the SQL Server side you build the same normalized string and hash it with HASHBYTES('MD5', ...). Because both sides use MD5, the hashes are directly comparable — but only if the input is byte-for-byte equal. And there lie three pitfalls that can make any hash comparison a false negative:

  • Ordering. string_agg without ORDER BY concatenates the rows in arbitrary order — the hash is then different on every run. It needs a deterministic ordering over the primary key, the same on both sides.
  • NULL representation and separator. coalesce(..., '\N') replaces NULL with a placeholder that doesn’t occur in the real data. Without this step, ('a', NULL) and (NULL, 'a') are indistinguishable after concatenation. The separator (|) must likewise be a character that appears in no column.
  • Encoding and number/date format. The real cross-engine trap: if HASHBYTES is applied directly to nvarchar, SQL Server hashes the UTF-16 byte sequence, Postgres the UTF-8 one — for non-ASCII data you get different MD5 values even though the text is identical. You avoid it by explicitly serializing to UTF-8 on the source side before hashing. But encoding is only one source of error: credit_limit::text (decimal point!), date formats, trailing spaces from char(n), and implicit casts shift the byte value just as much — the same cosmetic representation differences as in the next section. Plain ASCII therefore does not protect you automatically; the direct MD5 comparison is reliable only with identically normalized serialization on both sides. Where that effort is too high, the value-true column aggregates are the more robust path.

The practical order: count(*) always, column aggregates for all tables, the row hash only for the critical ones — master data, financial transactions, anything where a single wrong row is expensive.

Stage 3: What the Hash Doesn’t Tell You

A hash is binary: equal or unequal. When it fires, you know that something somewhere differs — but not what, not where, and not whether it’s an error at all. Because some differences are expected when moving between two engines, and a bit-exact hash still reports them as a deviation.

The clearest example is a bit column that becomes boolean in Postgres. The value is semantically identical — but serialized, the source yields '1'/'0' and the target 'true'/'false'. A row hash over the text representation therefore differs in every row, even though not a single record was transferred wrongly. The same kind of purely cosmetic difference comes from trailing zeros in decimals (numeric(19, 2) serializes 1000.00numeric(19, 4) serializes 1000.0000), from the millisecond suffix on timestamps (10:00:00.000 vs. 10:00:00), or from padded spaces out of a char(n) column:

  1: -- Same value, different serialization -> different hash
  2: SELECT
  3:     md5('1')         AS bit_source      -- SQL Server: bit
  4:    ,md5('true')      AS boolean_target  -- PostgreSQL: boolean
  5:    ,md5('1000.00')   AS decimal_source  -- numeric(19, 2)
  6:    ,md5('1000.0000') AS decimal_target; -- numeric(19, 4)

All four hashes differ, even though each pair means the same value. Such representation differences have to be normalized before hashing (cast booleans to a single format, round decimals to the same scale, format timestamps identically) — otherwise the one real problem drowns in a hundred cosmetic false alarms. That’s exactly why the value-true column aggregates from Stage 2 and the targeted checks here are often more reliable than a bit-exact hash over the raw representation.

The following checks make the real migration-induced differences visible — the ones where a value actually got lost.

Empty string instead of NULL. The classic transfer trap: if the move runs through CSV, NULL values turn into empty strings depending on the export setting — the row count stays the same, the content doesn’t.

The obvious reflex — “count the empty strings in the target” — falls short here. The source may itself contain legitimate empty strings; a bare empty-string count on the target side has no reference value to prove anything against. Meaningful only is the comparison of both sides, and that separately by NULL and empty string:

  1: -- count NULL and empty string separately — same query on source and target
  2: SELECT
  3:     count(*) - count(email)                  AS null_count
  4:    ,count(CASE WHEN email = '' THEN 1 END)   AS empty_count
  5: FROM
  6:    customer;

count(*) - count(email) is the number of NULL values (count over a column ignores NULL), empty_count the number of empty strings. In a clean migration, both numbers match between source and target. But if the transfer flips NULL into empty string, the ratio shifts: the target has fewer NULL values and exactly that many more empty strings than the source. The sum of the two stays constant — which is why neither the row count nor a plain empty-string counter notices anything. Only the divergence of the two individual values in the side-by-side comparison makes the silent type change visible.

Truncated strings. If a column was dimensioned too tightly (or an nvarchar(max) source mapped onto a limit), the import truncates long values — silently. The clean proof here too is the side comparison: put the maximum string length in source and target next to each other.

  1: -- maximum length per column — compare source against target
  2: SELECT max(length(email)) AS max_email_len
  3: FROM   customer;

If the target maximum is shorter than the source’s, something was truncated. If the source is no longer available after the migration, only the weaker heuristic remains: if the maximum sits exactly on the column width (say, exactly 50), it may be truncation — but it’s no proof, since a genuine value of exactly 50 characters is perfectly legitimate. Then a targeted sample of the boundary values is worth it.

Rounded numbers, shifted dates. A decimal/money mapped to too few decimal places, or a float that took the detour through a text file, can flip on the last digit. Date values are hit in two ways: a high-precision datetime2(7) loses fractions of a second when exported via CSV, and a datetimeoffset moving to timestamptz shifts by hours if the import assumes the wrong source time zone. (The coarse ~3.3 ms granularity of the old datetime type, by contrast, is not a migration difference — that already-rounded value moves across unchanged.) You catch such deviations not via count, but via the sum (for amounts) and the extremes min/max (for date as well as numeric columns) from Stage 2 — and for pinpointing them via a direct difference join, when both databases are reachable at once (e.g. via a foreign data wrapper). Which types flip during the move and which pass through cleanly is covered by the data-type article below; here only this matters: the sum of a money column — or the maximum of a timestamp column — is a sharper sentinel than its row count.

These three checks are not a complete catalog, but the spots where, in experience, something goes wrong first. They share one trait: unlike the hash, they yield an interpretable number with which you can narrow down the cause.

Structural Checks: Foreign Keys and Sequences

Beyond the content, there are two structural checks that belong to every migration — because they find exactly the errors that would otherwise only show up in live operation.

Orphaned foreign keys. If the constraints were disabled during the transfer (a common step to ease the load order), there can be child rows without a matching parent row. A LEFT JOIN with IS NULL on the parent side counts them:

  1: SELECT count(*) AS orphaned_orders
  2: FROM
  3:    orders T01
  4:    LEFT JOIN customer T02
  5:    ON
  6:      T02.customer_id = T01.customer_id
  7: WHERE
  8:    T02.customer_id IS NULL;

The result must be 0. If it is, the constraints can then be cleanly re-VALIDATEd afterwards — the schema article below covers constraint revalidation.

Sequence high value. The most common post-load bug of all: the data is in, but the sequence behind the IDENTITY column wasn’t advanced to the highest value. The next INSERT then assigns a key ID that already exists — and collides.

  1: SELECT
  2:     max(customer_id)                              AS max_id
  3:    ,(SELECT last_value FROM customer_customer_id_seq) AS seq_value;

seq_value must be >= max_id. If the sequence is too low, you set it correctly with setval() — that belongs to the wrap-up of every migration with surrogate keys and is part of schema migration (link below).

Running It Generically: the Data-Quality Framework

The checks above are handwritten — one query per table, per column, per rule. For a manageable schema, that’s exactly right. For hundreds of tables, it turns into busywork that calls for automation.

That’s exactly what the configurable data-quality framework, covered in detail in its own article, is for: check rules sit as rows in a configuration table, a routine builds the concrete checks from them via dynamic SQL and writes the hits into a shared error table. The NULL-rate, length, and range checks from Stage 3 are nothing but generic check rules — instead of typing them out for each column, you store them once as configuration and let them run across all tables.

The verification of a migration is thus a special case of applied data quality: the target state is “identical to the source”, the rules are the same ones you run in live ETL operation. How this framework is built, and how the dynamic rule evaluation works safely (with %I/%L instead of string concatenation), is in the linked article — here it’s enough to note that the recurring checks don’t have to stay handwritten.

Sampling Very Large Tables

A full hash over a billion rows reads the whole table — on both sides. When that gets too expensive, sampling is the pragmatic alternative. But there’s a rule on which most first attempts fail: the sampling must pull exactly the same rows on both engines.

A random sample (TABLESAMPLEORDER BY random()) doesn’t do that — it pulls different rows in Postgres than in SQL Server, and comparing two different samples says nothing. What works is deterministic sampling over the primary key: a reproducible selection that hits the same subset on both sides.

  1: -- every 100th row, chosen deterministically by key
  2: SELECT
  3:    md5(string_agg(customer_id::text, '' ORDER BY customer_id)) AS sample_hash
  4: FROM
  5:    customer
  6: WHERE
  7:    customer_id % 100 = 0;

The same WHERE condition (customer_id % 100 = 0) hits the same keys on the SQL Server side — the two samples are congruent and therefore comparable. Alternatively, you split the key range into blocks (BETWEEN) and hash block by block; that lets you narrow down a damaged region without reading the whole table again.

The honest assessment: sampling reduces the risk, it doesn’t eliminate it. It finds systematic errors reliably — a wrongly mapped type or a pervasive encoding fault hits every sample — whereas a single wrong row between two sample points stays undetected. For the critical tables there’s no way around the full hash; sampling is the tool for the large, uncritical bulk where a full reconciliation is out of proportion to the benefit.

FAQ

Is it enough to count the rows after the migration?

No. The count(*) comparison is the necessary but not the sufficient condition: it proves no row is missing, but not that the rows hold the same content. Rounded amounts, NULL values flipped to empty strings, truncated text, and encoding damage leave the row count untouched. Only column aggregates (summin/max) and row hashes check the content.

How do I compare content between SQL Server and Postgres when the hash functions differ?

Via a normalized representation on both sides. The way is to serialize each row on both sides into an identical string (same separator, same NULL placeholder, same number/date format) and hash that with MD5 — HASHBYTES('MD5', ...) in SQL Server, md5(...) in Postgres. If HASHBYTES is applied directly to nvarchar, the encoding also has to be aligned (SQL Server then hashes UTF-16, Postgres UTF-8), otherwise the hashes differ despite identical text — on the source side, a UTF-8 serialization before hashing helps.

Why don’t my money amounts match after the migration?

Usually it’s the type mapping or the transfer path, not the data itself. A money or decimal mapped to too few decimal places rounds on load; a float that went through a CSV file can flip on the last digit. The sum of the column (Stage 2) shows the deviation immediately, the row count never. Which types convert cleanly and which produce silent differences is covered by the data-type article.

How do I compare floating-point columns (float/real) reliably?

Not on exact equality, but with tolerance. float and real are IEEE 754 on both engines — bit-identical on a binary transfer, but as soon as the path runs through a text representation (CSV), the last significant digit can flip, because a double needs 17 significant digits for a lossless round trip. You then compare the relative error abs(a - b) / abs(a) against a threshold (say 1e-12) or round both sides with round(value::numeric, 10) before hashing. How many digits must match is a business decision, not a technical detail. And beware: even sum() over many floating-point values is order-dependent and thus not bit-stable — that aggregate, too, you compare only “close enough”, never exactly.

How do I verify tables with billions of rows?

With deterministic sampling over the primary key (customer_id % 100 = 0 or key ranges via BETWEEN) — never with random samples that pull different rows on each engine. For the truly critical tables the full hash remains mandatory; sampling covers the large, uncritical bulk and is a risk trade-off, not a proof.

When does a migration count as verified?

When the row count matches exactly per table, the column aggregates agree, the critical tables yield an identical row hash, no orphaned foreign keys exist, and the sequences sit at their high value. Only then is “the data is across” more than an assumption.

This article is part of a series on migrating from SQL Server to PostgreSQL. The other parts:

Because post-load verification is applied data quality: