Data quality in an ETL process

There is a lot of good literature on the subject of data quality. As a rule, this topic is described more theoretically, with impressive examples and the importance is proven with theories. However, I have not yet found any concrete instructions on how exactly to implement the checking and handling of bad data in an ETL process.

What appears to be an easy task turns out to be challenging upon a closer look.

There are essentially two aspects that need to be taken into account: From a technical point of view, it must first be checked whether the extracted data can be converted into the data types of the target system and whether they correspond to possible restrictions such as value ranges and notation. From a technical perspective, it is more about ensuring the completeness and consistency of the data and whether the data is uptodate. These properties are often summarized under the term data integrity.

Based on this distinction, I summarize these two aspects under the terms technical data quality and dataintegrity.

Checking the dataintegrity and technical data quality ensures that the delivered data can be processed safely and loaded into the target system. Technical and errors related to dataintegrity may lead to an error when loading the processed data into the target system and, in the worst case, to the termination of the ETL process. An ETL process should be designed to proactively identify, log and handle technical data and dataintegrity errors. If in doubt, incorrect data should not be loaded into the target system. In order to develop such an ETL process, the following must be taen into account:

  • Checks technical data quality
  • Check dataintegrity
  • Logging of all found errors
  • Mark erroneous data
  • Exclude erroneous data from further processing

With this series of articles I will present a design pattern for an ETL process in which the above mentioned tasks can be implemented with a reasonable amount of effort.

Related Posts