Data Quality in an ETL Process — Catching Technical and Business Errors Before They Reach the Target System

A single value that cannot be converted — a date in the wrong format, a number with the wrong decimal separator — and the entire ETL run aborts. Data quality in an ETL process means catching such errors proactively: identifying, logging and isolating them before they reach the target system. This article is the entry point to a series that implements exactly that as a design pattern.

TL;DR — what this article covers:

  • Technical vs. business data quality — the two aspects every ETL process must handle separately.
  • The five core tasks — check technically, check on business rules, log, flag, exclude.
  • The WHERE-clause principle — how to find data errors systematically with simple queries.
  • Where this fits in the overall pattern — how this overview interlocks with the architecture and the logging articles.

Prerequisite. A basic understanding of ETL processes. This is a conceptual article — not a step-by-step tutorial. The concrete code building blocks are provided by the linked sibling articles of the series.

Contents

Overview: What data quality means in an ETL process

There is a lot of good specialist literature on the subject of data quality. As a rule, the topic is described rather theoretically — with impressive examples. Concrete guidance, however, on exactly how to implement the checking and handling of bad data in an ETL process is rarely available. This series closes that gap.

There are essentially two aspects to consider:

  • Technically, you first check whether the extracted data can be converted into the data types of the target system and whether it complies with constraints such as value ranges and notation.
  • From a business point of view, it is about ensuring the completeness, consistency and timeliness of the data. These properties are often summarized under the term data integrity.

Following this distinction, I group the two aspects under the terms technical data quality and business data quality.

Checking both ensures that the delivered data can be processed safely and loaded into the target system. Technical and business errors can cause a failure during loading and, in the worst case, abort an ETL process. An ETL process should therefore be designed so that technical and business data errors are identified, logged and handled proactively. When in doubt, faulty data is not loaded into the target system. From this, five core tasks emerge:

  1. Checking technical data quality — can the value be converted into the target data type?
  2. Checking business data quality — is the value complete, consistent and plausible?
  3. Logging all errors found in an error table.
  4. Flagging records that were delivered with errors.
  5. Excluding the flagged records from further processing.

With this series of articles, I present a design pattern for an ETL process in which these tasks can be carried out with reasonable effort. The technical foundations are described in the article Design Pattern // The Architecture of an ETL Process.

Checking technical data quality

Checking technical data quality is about ensuring that the data to be processed can be converted into the respective target data types. This is especially necessary when the data — untyped — has to be read from a text file (CSV, XML, JSON) or from EXCEL. The prerequisite is a safe type conversion that does not abort the ETL process. At the same time, the result of the conversion should enable proactive identification of problems.

The basics of safe type conversion are described in the TRY_CONVERT articles. The actual check is then done with simple queries on the converted data: on every column that holds a converted value, a WHERE clause is executed. If the WHERE clause returns records, those are conversion errors. All errors found are logged in an error table.

Typical checks at the technical level:

  • Convertibility into the target data type (datedecimalint, …).
  • Value range — does the number fit the target type without overflowing?
  • Notation/format — phone numbers or postal codes following a defined pattern, for instance.
  • Mandatory field — was a value delivered at all where one is expected?

Checking business data quality

Checking business data quality is, in principle, a complex task. But as so often, quality can be improved considerably with simple measures. A first check can be carried out on the extracted data; further checks are performed after the data has been transformed.

Here, too, the approach described above applies — identification via WHERE clauses and logging of the data errors. Typical business checks:

  • Duplicates — does a key that the delivery contract defines as unique appear more than once?
  • Foreign-key plausibility — does the referenced value exist at all?
  • Business logic — a date of birth must not lie in the future, an order value must not be negative.
  • Cross-field consistency — do dependent values match each other?
  • Timeliness — does the delivery or posting timestamp fall within the expected time window, or is the record stale?

The boundary between technical and business checks is fluid and ultimately a matter of definition. What matters is that both classes of errors are detected, logged and handled separately.

Where this fits in the ETL design pattern

This article is the root of a series that deepens the topic step by step:

  • Data quality (this article) — the why and the what: the two aspects, the five core tasks, the WHERE-clause principle.
  • The Architecture of an ETL Process — the how: work packages, schema layering E0–L2, which check applies at which schema boundary.
  • Logging an ETL Process — the with what: three-tier logging tables and stored procedures that make run, component and action analyzable.
  • Safe type conversion — the concrete code building block for the technical check (see the TRY_CONVERT series below).

If you want to build a robust ETL process from the ground up, read the series in this order.

FAQ

What is the difference between technical and business data quality?

Technical data quality checks whether a delivered value can be converted into the target data type at all and whether it meets formal constraints such as value range or notation — a property of the individual value. Business data quality goes further and checks whether the value is plausible, complete and consistent in content (e.g. a date of birth that does not lie in the future). The two classes are detected and logged separately because they have different causes and different fixes.

Where do I start if I want to add data quality to my ETL process?

With the technical check — it has the best effort-to-benefit ratio. A safe type conversion plus one WHERE clause per converted column already catches the most common causes of aborts. The business checks come afterwards, starting with simple business logic such as mandatory-field and plausibility checks.

How does the WHERE-clause principle work in practice?

After conversion, both the raw value and the converted value are available per column. A WHERE clause finds the records where the conversion failed (the converted value is NULL while the raw value is not empty) or where a business condition is violated. Every hit is an error and is written to an error table. Details and examples are provided by the TRY_CONVERT series.

Does this approach also work with Postgres instead of SQL Server?

The principle is database-agnostic — safe conversion, WHERE-clause checks and error logging work in any relational database. The concrete code in this series is T-SQL (TRY_CONVERT, SSIS). In Postgres, a conversion with exception handling or a CASE expression takes on the role of TRY_CONVERT; the higher-level pattern stays identical.

How does this article relate to the ETL architecture?

This article describes the what and the why of data quality. Where in the process each check applies — at which schema boundary (E0–L2) — is described in The Architecture of an ETL Process. The logging article shows how the errors found are stored in an analyzable form.

Basics

Safe type conversion (TRY_CONVERT series)