A Data Engineering Story

What do you do when your clients’ Data Warehouse turns out to be a data dumping ground?

This is based on a use case from a real client and the methods we employed in order to process their data. As with most ETL pipelines we were tasked with extracting data from one database system, enriching it with business rules, filtering and processing that data and finally writing it to a different database so that it could be displayed by a front-end GUI system in a manner that was most useful for the client.

The Starting Point

The client gave us at Synvert access to their data warehouse. As a data warehouse we were expecting some sort of star or snowflake schema with various fact and dimension tables; or some other indication that the record is new or has been updated that we could use to process only the new or updated data.

Except that is not what we got. Instead we were given 11 different database table views (number to be increased as the project progressed). No details or access were provided to see how the views were built, just read access to those views. And to make matters worse, no indication of what was a new or updated record, so every time we read the data we had to read it all, every time.

Dealing with the Views

So the first thing to handle was processing of the views. Each view is different, different names, different column names, different numbers of columns and different data types.

What is required is a simple process that can perform the task of extracting data from any database table and writing it to any file. This simple process has to be generic, so that it can process any database view and write the contents to any file location. We can accomplish this generic system by having the simple process run from a series of configuration files. Each configuration file provides the specific details required (for example, database name, table name, table data structure, output filename etc) to extract data from one database view and write those contents to one file.

In this way we have one program and many configuration files. This gives us the advantage that if the code were to change (say by the addition of a second output file) then the code would only need to change once and all the configuration files would remain unchanged.

What is in a configuration file?

Each of the configuration file provides the name of the process that is to be run along with the required details of the database view to be read, the data structure of that view and details of the file the data is being written to.

In this way we can build 11 configuration files – and easily add more when the requirements change – and each one extracts data for a specific database view.

So now we have 11 configuration files that all run the same simple generic process, but with different input values.

Next question though, is how do you run those 11 configuration files? And What happens when 11 files become 15 or even 200 files?

Welcome to loops

In the ETL tool we use at Synvert, the configuration files and looping system is done in a specific way linked to our ETL tool, but the basic idea is as described here.

It is very easy in unix to create a loop that will run for say, a number of files with a matching file pattern. Since all our configuration files described above all have a simple and similar naming structure we can easily use standard unix commands to find and run each configuration file.

For example:

for f in *config_file_pattern*.config; do ./$f; done

Now it does not matter if there are 11, 15 or even 200 configuration files, the same simple unix command will find and run all of them.

Now we have copied all the data from the views into local files. What is next?

Dealing with the Files

Ok, big deal, so far all we have done is copy all the data from the views in the data warehouse into local files. Why?

Well, having the data in local files now makes it much easier for us to detect updates and new records.

The reason for this is that as the code reads each record from each file as a single line of text, the whole line can be SHA-256 hashed, the hash value is checked against a local file of all previous hashes. If the hash is found in the lookup then the record has not changed and can be discarded; if the hash is not found, then either it is a new record or it is an updated record, either way this is a record we need to keep and process.

Finally the new hashes are added to the lookup file for the next time.

This is a simple but effective method to determine whether a record is new, updated or unchanged, and as promised, that is why we drop the views to local files first.

But won’t we get hash collisions?

Well, the likelihood of a SHA-256 collision is 4.3×10^-60 so the chances of a collision are too small to be worried about in this event.

When each view names the same thing differently

“I named all the stars in the night sky, but I named them all the same. It´s caused awful confusion in Heaven” – Drucilla (Buffy the Vampire Slayer)

We have now extracted all the data from the views and determined which records are new or have been modified. The next problem is that the same data is named differently across the 11 views. Some examples:

Name in the View	Common Name
loc	location
location
Location
werks
ttnr	ttnr
matnr
plan_drp

We would like to always process the data using the common name, but as you can see sometimes 3 or 4 different names are used across the views.

What we do not want is a long set of if (..) else if (..) else statements dotted throughout the code, we want a data driven method because in the future, there will be more than 11 views and many more similar names to match.

Reducing data to common fields

The solution creates a data file, maintained by the business team which describes the original name along with a mapping to the common name for each field in each view.

The development team then use this file along with a second data file that tells them where in each file, the original name can be found in each file.

A simple generic piece of code was then provided to link those two data files together so that for each common field name the code would like to use, it could find the original name in the data file and read that value.

In our ETL tool environment, we have a specific way to generate code automatically given only the output data structure. This automatic coding means that when the output data structure changes, the code will generate automatically to match saving time and effort and reducing risk; all this leads to better coding, simpler testing and better results for our clients.

A short diversion into data structures

Each file we write as an extract from the database views has a specific data structure that describes the names and data types of each column of each database view. When we write these views to files then the same data structure also describes each column of data in the file.

If we store the data structure information for each view in a file then we can process the information in those files to create a lookup file as shown below that shows the position of each column name within the data file.

For example, the field works is in position 0, root_matnr is in position 1 etc. for the file named sap_bom_path.dat

We can use this information to generate a lookup file each time the process runs, so that should some development have taken place that changed the order of the fields, then the lookup would still show all the field positions correctly.

OK, now we have a mapping that explains how each view field name maps to its common name, and we have a lookup the shows exactly where each field appears in each file.

So all we have to do now is for each incoming field, map the different incoming names to the common name that will be used in the rest of the code.

Finally…

From this point in the graph going towards the destination database, all the processing can be done from a fixed set of field names.

And that is it, we have taken a dumping ground of data, extracted just the new and updated data and then funneled all the differing field names into a consistent set of field names that we can process, knowing that for example location is always location, ttnr is always ttnr etc regardless of what the source called it.

How we solve these issues in our specific ETL environment is something you will learn at Synvert as an ETL Consultant.