Overview
SQL Server Integration Services (SSIS) is a very powerful graphical tool set belonging to the Microsoft BI Stack along with SQL Server, SQL Server Analysis Services (SSAS), SQL Server Reporting Services (SSRS), etc. It supports a broad variety of data migration, integration and transformation tasks. With that, you will find a bunch of good reasons why to utilize this tool set. However, there are a plenty of reasons why not to use SSIS. With Transact SQL (T-SQL) Microsoft provides with respect to transformation tasks an equivalent technique.
This article belongs to a series of articles that are dealing with some important criteria for choosing the right technology(ies) – SSIS and/or T-SQL.
—
Source code management enables the storage of source code in different versions. After a change, the changed file can be saved as a new version. With the changed version of the file, metadata such as the date and time of the change, user ID of the person who made the change, reference to a change request, etc. are also saved. Older versions of a document can be restored with it.
In addition, source code management systems offer additional features that are essential for working in a multi-developer team and also for release management. In short: Source code management is essential for professional software development. Well-known source code management systems in the Microsoft environment are the Team Foundation Server (TFS) or Azure DevOps Server or Git and can be integrated into the Visual Studio development environment.
An important feature of a source code management is the comparison of two versions of a file. A comparison of two versions of the same file determines the differences between the two versions. A use case for this is the 4-eyes principle. One developer works on a document and another carries out a review of the changed document. Only the changes in the new version are relevant for the review and these can be determined by comparing the versions.
However, the comparison of two versions of a file is only helpful, if the changes can both be determined clearly and the meaning interpreted. A comparison of binary files is usually not helpful since is not readable and the meaning behind the modified binary code remains unclear. When comparing two versions of a text file, there is a good chance that both are recognizable. However, comparing text files can be challenging if the data in the text file is stored hierarchically and in a structured manner. SSIS packages are stored in a hierarchical structure in XML files with the extension *.dtsx, while SQL statements are usually saved as (flat) script files with the extension *.sql.
This article describes the comparison of two version of SQL scripts and SSIS packages using three examples.
- Comparison of two versions of a SQL script
- Comparison of two versions of a simple SSIS package
- Comparison of complex development artifacts
Comparison of two versions of a SQL script
The following screenshot shows the result of a comparison of two versions of a SQL script in Visual Studio. Deviations are highlighted in red and green color.
The screenshot shows the SQL statement before the change on the left and the changed SQL statement on the right. In the changed SQL statement, lines that contain a change are highlighted in light green. Changes themselves are highlighted in a bolder green. In the previous version, the corresponding texts are highlighted in red.
To the right of the vertical scroll bar, the changed areas are indicated throughout the document. A good description of the use of the file comparison can be found in the online documentation from Microsoft.
Comparison of two versions of a simple SSIS package
The ETL processes developed with SSIS are saved in SSIS packages with the extension *.dtsx. SSIS packages are saved as XML documents. Microsoft has published the following article in the online documentation with reference to the documentation of SSIS packages: SSIS Package Format and explains among other things:
“In the current release of Integration Services, significant changes were made to the package format (.dtsx file) to make it easier to read the format and to compare packages. You can also more reliably merge packages that don’t contain conflicting changes or changes stored in binary format.”
This quote is taken from the very first (!) section of the article. Microsoft emphasizes that the format has been further developed in order to be able to compare packages more easily, among other things.
The following two screenshots show two versions of a control flow of an SSIS package, in which only the name of the second script task was changed from B SCT script task to D SCT script task. The screenshots were taken from a control flow of a simple SSIS package, that contains some random control flow tasks without a full blown configuration. The example is a simple and therefore only striking example.
Version 1
Version 2
Result of the comparison
The comparison of both versions reveals astonishing things. The change of the name of a script task results in an incredible 8 changed areas of the SSIS package / XML document, which are displayed to the right of the vertical scroll bar.
Lines 60 to 77 represent (among other things) the script task B SCT script task in the previous version. According to the comparison, this script task was renamed to C SCT script task and not to D SCT script task. Lines 78 to 95 represent (among other things) the script task C SCT script task in the previous version. According to the comparison, this script task was renamed D SCT script task.
Reminder: only the name of the script task B SCT script task was changed to D SCT script task.
The comparison simply produces a wrong result.
Comparison of complex development artifacts
This section focusses on a more complex example that could be found in practice in a similar form.
Requirement
In this example, the ranking of the employees in table [AdventureWorksDW2017].[DimEmployee] is to be determined for each hierarchy level along the vacation times and sick times of the employees. Four key figures must be determined for each employee. The methodology of the SQL Server aggregate functions (windowed function) specified for the key figures must be used to calculate the key figures:
- Ranking vacation hours: NTILE(3)
NTILE(3) distributes the number of employees by hierarchy level into three groups of equal size and partitions the employees by their vacation hours. - Ranking vacation hours: DENSE_RANK
Determine the rank of employees by hierarchy level ordered by vacation hours. Employees that have the same amount of vacation hours have the same rank. - Ranking sick leave hours: NTILE(3)
NTILE(3) distributes the number of employees by hierarchy level into three groups of equal size and partitions the employees by their sick leave hours. - Ranking sick leave hours: DENSE_RANK
Determine the rank of employees by hierarchy level ordered by sick leave hours. Employees that have the same amount of sick leave hours have the same rank.
This task will be solved using an SQL statement as well as an SSIS package. After the development of both artifacts will be changed in a way that the grouping criterion is no longer the periods of vacation and illness, but the date of employment of the employee and his date of birth.
There are two challenges to this task:
- Determining the hierarchy of employees
- Determination of the ranking
SQL Statement
Transact-SQL introduces easy-to-use methods for both challenges:
- Common Table Expression (CTE)
- Fensterfunktionen (Windowed Function)
If data is hierarchically organized in a father-child relationship – as it is in the table [DimEmployee] – you can easily query the structure recursively using a CTE:
This statement provides each employee in the [Level] column with the hierarchy level as well as the required key figures along the vacation and sick times. The 254 employees are organized in 5 hierarchical levels.
The comparison of the two versions of the SQL statement has already been shown in the first screenshot. Therefore, there is no further illustration of the amended statement here.
Both statements are available via the following two links:
- post0020_sample3_version_1 (Vacation times /Ill times)
- post0020_sample3_version_2 (Hire date/Birth date)
SSIS Package
The above solution was not too tricky! However, finding an SSIS approach that relies only on SSIS data flow tasks and that is as easy as the above solution was something of a challenge. It took me quite a few hours to develop the below solution. I chose the following:
- The solution requires two tables in the database.
- The data flow 1000 DFT Calculate Levels only determines the hierarchy.
- The data flow 3000 DFT Caculate Ranking only determines the ranking and stores the result in a table.
- The ranking for vacation hours and sick leave hours by hierarchy level will be calculated in two script tasks.
Possibly there is a much better and easier solution for the task.
Control Flow
The control flow of the SSIS package just contains 4 tasks and appears to be simple thing.
Dataflow 1000 DFT Calculate Levels
While T-SQL offers an easy to use approach, SSIS does not support a generic approach for a recursive query nor is there an easy to configure task in the SSIS data flow toolbox available. The determination of the 5 hirarchy levels must be developed level by level:
Dataflow 3000 DFT Caculate Ranking
The rankings will be calculated in two script tasks by comparing consecutive records.
Modifications to the package
After having completed the development of the package, I checked in the package in Team Foundation Server (TFS). The next step was to modify the order criteria for ranking the employees by hierarchy level from holiday / sick leave hours to hiring date / birth date and then check in the modified version of the package.
What were the modifications in detail?
- Modification of field names
- Modification of the data typs of two columns/variables
Despite the larger number of tasks used in the package, the scope of changes has been limited and the work was done within minutes.
Comparing the original and the modified version
A comparison of both versions produced the following result:
Have a look at the right hand part of the vertical scroll bar. The modified spots are highlighted with green/red color. Wow! Those few modifications resulted in quite a few modifications in the underlying XML document.
Source Code
The examples in this article were developed with Visual Studio 2017 and SQL Server 2017. The source code can be downloaded from here:
Summary
While the T-SQL solution was developed within a short period of time, the development of the SSIS package took me several hours.
As there is no Out-Of-The-Box SSIS solution that would meet the requirements I first had to figure out an approach that produces the expected result. In the beginning I was playing around with a single formular expression that calculates the hierarchy level. But after two or three hours I waived the white flag and chose to follow the approach shown in this article at the cost of readability and fexibility of the the SSIS package. I have also shown that minor modifications usually result in major modifications in the underlying XML document. In contrast to the complexity of developing SSIS packages the development and modification of the SQL statement appeared to be rather easy.
Most of that leaves me with the following conclusion:
- SSIS packages are hard to read
- SSIS packages are difficult to maintain
- Comparison of SSIS packages is not possible
SSIS is a very powerful toolset. However, whether or not to use SSIS for data transformation is questionable for most every day requirements. If you are familiar with T-SQL you would probably prefer SQL statements and Stored Procedures over complex SSIS Packages.