A Data Engin­eer­ing Story



What do you do when your cli­ents’ Data Ware­house turns out to be a data dump­ing ground?

This is based on a use case from a real cli­ent and the meth­ods we employed in order to pro­cess their data. As with most ETL pipelines we were tasked with extract­ing data from one data­base sys­tem, enrich­ing it with busi­ness rules, fil­ter­ing and pro­cessing that data and finally writ­ing it to a dif­fer­ent data­base so that it could be dis­played by a front-end GUI sys­tem in a man­ner that was most use­ful for the client.

The Start­ing Point

The cli­ent gave us at Synvert access to their data ware­house.  As a data ware­house we were expect­ing some sort of star or snow­flake schema with vari­ous fact and dimen­sion tables; or some other indic­a­tion that the record is new or has been updated that we could use to pro­cess only the new or updated data.

Except that is not what we got.  Instead we were given 11 dif­fer­ent data­base table views (num­ber to be increased as the pro­ject pro­gressed).  No details or access were provided to see how the views were built, just read access to those views. And to make mat­ters worse, no indic­a­tion of what was a new or updated record, so every time we read the data we had to read it all, every time.

Deal­ing with the Views

So the first thing to handle was pro­cessing of the views. Each view is dif­fer­ent, dif­fer­ent names, dif­fer­ent column names, dif­fer­ent num­bers of columns and dif­fer­ent data types.

What is required is a simple pro­cess that can per­form the task of extract­ing data from any data­base table and writ­ing it to any file. This simple pro­cess has to be gen­eric, so that it can pro­cess any data­base view and write the con­tents to any file loc­a­tion. We can accom­plish this gen­eric sys­tem by hav­ing the simple pro­cess run from a series of con­fig­ur­a­tion files. Each con­fig­ur­a­tion file provides the spe­cific details required (for example, data­base name, table name, table data struc­ture, out­put file­name etc) to extract data from one data­base view and write those con­tents to one file.

In this way we have one pro­gram and many con­fig­ur­a­tion files. This gives us the advant­age that if the code were to change (say by the addi­tion of a second out­put file) then the code would only need to change once and all the con­fig­ur­a­tion files would remain unchanged.

What is in a con­fig­ur­a­tion file?

Each of the con­fig­ur­a­tion file provides the name of the pro­cess that is to be run along with the required details of the data­base view to be read, the data struc­ture of that view and details of the file the data is being writ­ten to.

In this way we can build 11 con­fig­ur­a­tion files – and eas­ily add more when the require­ments change – and each one extracts data for a spe­cific data­base view.

So now we have 11 con­fig­ur­a­tion files that all run the same simple gen­eric pro­cess, but with dif­fer­ent input values.

Next ques­tion though, is how do you run those 11 con­fig­ur­a­tion files? And What hap­pens when 11 files become 15 or even 200 files?

Wel­come to loops

In the ETL tool we use at Synvert, the con­fig­ur­a­tion files and loop­ing sys­tem is done in a spe­cific way linked to our ETL tool, but the basic idea is as described here.

It is very easy in unix to cre­ate a loop that will run for say, a num­ber of files with a match­ing file pat­tern. Since all our con­fig­ur­a­tion files described above all have a simple and sim­ilar nam­ing struc­ture we can eas­ily use stand­ard unix com­mands to find and run each con­fig­ur­a­tion file.

For example:

for f in *config_file_pattern*.config; do ./$f; done

Now it does not mat­ter if there are 11, 15 or even 200 con­fig­ur­a­tion files, the same simple unix com­mand will find and run all of them.

Now we have copied all the data from the views into local files.  What is next?

Deal­ing with the Files

Ok, big deal, so far all we have done is copy all the data from the views in the data ware­house into local files. Why?

Well, hav­ing the data in local files now makes it much easier for us to detect updates and new records.

The reason for this is that as the code reads each record from each file as a single line of text, the whole line can be SHA-256 hashed, the hash value is checked against a local file of all pre­vi­ous hashes. If the hash is found in the lookup then the record has not changed and can be dis­carded; if the hash is not found, then either it is a new record or it is an updated record, either way this is a record we need to keep and process.

Finally the new hashes are added to the lookup file for the next time.

This is a simple but effect­ive method to determ­ine whether a record is new, updated or unchanged, and as prom­ised, that is why we drop the views to local files first.

But won’t we get hash collisions?

Well, the like­li­hood of a SHA-256 col­li­sion is 4.3×10-60 so the chances of a col­li­sion are too small to be wor­ried about in this event.

When each view names the same thing differently

“I named all the stars in the night sky, but I named them all the same.  It´s caused awful con­fu­sion in Heaven” – Dru­cilla (Buffy the Vam­pire Slayer)

We have now extrac­ted all the data from the views and determ­ined which records are new or have been mod­i­fied. The next prob­lem is that the same data is named dif­fer­ently across the 11 views. Some examples:

Name in the ViewCom­mon Name
locloc­a­tion
loc­a­tion
Loc­a­tion
werks
ttnrttnr
matnr
plan_drp

We would like to always pro­cess the data using the com­mon name, but as you can see some­times 3 or 4 dif­fer­ent names are used across the views.

What we do not want is a long set of if (..) else if (..) else state­ments dot­ted through­out the code, we want a data driven method because in the future, there will be more than 11 views and many more sim­ilar names to match.

Redu­cing data to com­mon fields

The solu­tion cre­ates a data file, main­tained by the busi­ness team which describes the ori­ginal name along with a map­ping to the com­mon name for each field in each view.

The devel­op­ment team then use this file along with a second data file that tells them where in each file, the ori­ginal name can be found in each file.

A simple gen­eric piece of code was then provided to link those two data files together so that for each com­mon field name the code would like to use, it could find the ori­ginal name in the data file and read that value.

In our ETL tool envir­on­ment, we have a spe­cific way to gen­er­ate code auto­mat­ic­ally given only the out­put data struc­ture.  This auto­matic cod­ing means that when the out­put data struc­ture changes, the code will gen­er­ate auto­mat­ic­ally to match sav­ing time and effort and redu­cing risk; all this leads to bet­ter cod­ing, sim­pler test­ing and bet­ter res­ults for our clients.

A short diver­sion into data structures

Each file we write as an extract from the data­base views has a spe­cific data struc­ture that describes the names and data types of each column of each data­base view. When we write these views to files then the same data struc­ture also describes each column of data in the file.

If we store the data struc­ture inform­a­tion for each view in a file then we can pro­cess the inform­a­tion in those files to cre­ate a lookup file as shown below that shows the pos­i­tion of each column name within the data file.

For example, the field works is in pos­i­tion 0, root_matnr is in pos­i­tion 1 etc. for the file named sap_bom_path.dat

We can use this inform­a­tion to gen­er­ate a lookup file each time the pro­cess runs, so that should some devel­op­ment have taken place that changed the order of the fields, then the lookup would still show all the field pos­i­tions correctly.

OK, now we have a map­ping that explains how each view field name maps to its com­mon name, and we have a lookup the shows exactly where each field appears in each file.

So all we have to do now is for each incom­ing field, map the dif­fer­ent incom­ing names to the com­mon name that will be used in the rest of the code.

Finally…

From this point in the graph going towards the des­tin­a­tion data­base, all the pro­cessing can be done from a fixed set of field names.

And that is it, we have taken a dump­ing ground of data, extrac­ted just the new and updated data and then funneled all the dif­fer­ing field names into a con­sist­ent set of field names that we can pro­cess, know­ing that for example loc­a­tion is always loc­a­tion, ttnr is always ttnr etc regard­less of what the source called it.

How we solve these issues in our spe­cific ETL envir­on­ment is some­thing you will learn at Synvert as an ETL Consultant.