Data Qual­ity Accelerator



In our pre­vi­ous art­icle, we dove into the whats and whys of Data Qual­ity (DQ), and saw that it isn’t just a buzzword, it’s the back­bone of trust­worthy data ana­lyt­ics, AI, and decision-mak­ing. Ignor­ing Data Qual­ity isn’t just risky, it’s expens­ive: accord­ing to Gart­ner, poor DQ costs busi­nesses $12.9 mil­lion per year on aver­age, as inac­cur­ate cus­tomer insights lead to bad decisions, com­pli­ance fines, etc.

Flow diagram illustrating the impact of bad data quality—resulting in misleading insights, poor business decisions, and financial losses.

In this blog post we’ll con­tinue review­ing Data Qual­ity, but we’ll focus on the hows. As we saw in our recent art­icle How to Oper­ate your Enter­prise Data Plat­form – Meet Our EDPOps Accelerator, one of the first choices to be made when imple­ment­ing a DQ solu­tion in an organ­isa­tion is whether to build it or to buy it. Both choices offer their pros and cons and, as with many aspects of any data ana­lyt­ics jour­ney, which is the best for a par­tic­u­lar organ­isa­tion depends on vari­ous factors.

Here at synvert we have worked on both approaches dur­ing our years of exper­i­ence help­ing our cus­tom­ers. On the one hand we have worked with a “buy” approach, i.e. imple­ment­ing DQ solu­tions using Atac­camaInform­at­ica or Col­libra – all excel­lent choices that go bey­ond DQ, cov­er­ing other data gov­ernance aspects such as data cata­loguing. On the other hand, we have also helped cus­tom­ers with a “build” approach, craft­ing vari­ous DQ solu­tions in dif­fer­ent tech stacks, some of which we have already presen­ted in our blog: see our posts about Dat­aWash, a DQ solu­tion for Snow­flake (first and second post), and our art­icle about a solu­tion for Cloudera using Python and Great Expect­a­tions.

A key con­sid­er­a­tion when build­ing a DQ solu­tion is that our high-level approach remains con­sist­ent, regard­less of the tech­no­logy stack used. By com­bin­ing this approach with the tools we have imple­men­ted across dif­fer­ent envir­on­ments and our extens­ive expert­ise in Data Gov­ernance too, we are pleased to intro­duce our Data Qual­ity Accelerator, a cost-effect­ive and adapt­able solu­tion for accel­er­at­ing DQ!

Let’s take a closer look at our DQ Accelerator. We’ll begin by recom­mend­ing our Data Gov­ernance and Strategy Assess­ment as the first step, and then we’ll out­line the core com­pon­ents of the com­mon approach, fol­lowed by a brief over­view of the reusable, adapt­able, and extend­able tools we’ve developed. Finally, we’ll present a demo of our tool for Dat­ab­ricks, which integ­rates Python, Great Expect­a­tions, and Power BI. We’ll also see how our accelerator can sig­ni­fic­antly speed up your data qual­ity jour­ney: the Dat­ab­ricks ver­sion show­cased in this post was adap­ted and enhanced from a solu­tion ori­gin­ally built for Cloudera, enabling our cus­tomer to quickly imple­ment a strong DQ frame­work with min­imal devel­op­ment effort.

Our Data Qual­ity Accelerator blends:

Diagramm of Assessment, Reusable Tooling and Approach.

Data Qual­ity Accelerator – The Assessment

Whilst it may be tempt­ing to dive straight into reusing our exist­ing DQ tools, we recom­mend begin­ning your DQ jour­ney with an assess­ment to identify the most suit­able plan for your organ­isa­tion. In par­tic­u­lar, we sug­gest our broader Data Gov­ernance and Strategy Assess­ment, as even if your imme­di­ate focus is on DQ, any plan must align with your organisation’s over­all Data Gov­ernance and Data Strategy ini­ti­at­ives. Our experts can sup­port you in estab­lish­ing a com­pre­hens­ive Data Gov­ernance pro­gramme that encom­passes not only Data Qual­ity, but also Data Cata­loguing, Mas­ter and Ref­er­ence Data Man­age­ment, DevOps, Data Stew­ard­ship, and the over­all strategy, over­sight, and con­trol framework.

Data Governance and Strategy Assessment

Data Qual­ity Accelerator – The Approach

Once we have determ­ined that the best for your organ­isa­tion is to reuse our avail­able tool­ing to accel­er­ate your DQ jour­ney, it is import­ant to explain the com­mon approach we apply across all imple­ment­a­tions, regard­less of the tech­no­logy stack:

  1. Rules are defined to spe­cify which checks will be per­formed on the datasets.
  2. An engine peri­od­ic­ally executes these checks.
  3. Res­ults, fol­low­ing a com­mon base data model, are gen­er­ated by the engine and accu­mu­lated over suc­cess­ive exe­cu­tions. These accu­mu­lated res­ults enable the mon­it­or­ing of when issues are resolved.
  4. Front-end applic­a­tions are built to con­sume the res­ults and trig­ger alerts accordingly.

This is how our DQ Approach works:

Flow Diagram of Data Quality Accelerator

Data Qual­ity Accelerator – Reusable Tooling

We have built vari­ous tools to imple­ment the above approach for dif­fer­ent tech stacks. Whilst the tools share the same under­ly­ing meth­od­o­logy, they have been adap­ted to dif­fer­ent cus­tom­ers and their demands.

  • Dat­aWash for Snow­flake: Dat­aWash uses Snow­park as the engine and Stream­lit as the front-end applic­a­tion for both rule man­age­ment and the visu­al­isa­tion of res­ultsIt was the first data qual­ity tool we developed, ini­tially released over two years ago (see blog posts part1 and part2) and it has been evolving ever since. Check out our recent webinar on Dat­aWash 2.0.
  • Spark & GX Data Qual­ity Frame­work for Dat­ab­ricks and Cloudera: As men­tioned above, this frame­work was first developed for Cloudera, and we have recently adap­ted and exten­ded it to work with Dat­ab­ricks for one of our key cus­tom­ers. The engine uses Python with Great Expect­a­tions in Spark envir­on­ments, and Power BI is used for res­ult visualisation.
  • SQL Data Qual­ity Frame­work for RDBMSs: For envir­on­ments rely­ing on tra­di­tional SQL data­bases, such as that of one of our largest cus­tom­ers, we have cre­ated a frame­work that imple­ments the DQ approach described above, using SQL and stored pro­ced­ures as the engine. For visu­al­isa­tion, we built a dash­board in the customer’s report­ing tool, Oracle Busi­ness Intel­li­gence Suite Enter­prise Edi­tion (OBIEE). The solu­tion is cur­rently run­ning and has been tested in Oracle, but it can be used with any RDBMS that sup­ports stored pro­ced­ures. Moreover, since the data model used to store res­ults has the same base as in the Dat­ab­ricks and Cloudera frame­work, we could eas­ily adapt the Power BI dash­board developed there to work here too.

All the above tools are reusable and adapt­able; in fact, we’ve already reused and adap­ted them sev­eral times. As men­tioned, we exten­ded the frame­work ini­tially cre­ated for Cloudera to work with Dat­ab­ricks, and we also reused the Power BI dash­board developed for Databricks/Cloudera within the SQL frame­work, thanks to the shared base data model. This means we can read­ily adapt these tools to meet your spe­cific needs.

All these tools are live and act­ively used in pro­duc­tion, evolving across vari­ous cus­tomer envir­on­ments. Cur­rent and planned enhance­ments include the use of AI to recom­mend new rules, sim­pli­fied rule man­age­ment through intu­it­ive front-end applic­a­tions, rule onboard­ing life­cycle man­age­ment, and the intro­duc­tion of alert­ing sys­tems that go bey­ond dash­boards to deliver results.

In the next sec­tion we’ll go through a demo use case for the Spark & GX Data Qual­ity Frame­work for Dat­ab­ricks and Cloudera. As we already demoed the earlier ver­sion that only worked for Cloudera in a pre­vi­ous blog post, now we’ll look at the newer ver­sion that works with Dat­ab­ricks too. In a future art­icle, we’ll demo the SQL Data Qual­ity Frame­work for RDBMSs as well.

Demo – Spark & GX Data Qual­ity Frame­work for Dat­ab­ricks & Cloudera

Let’s con­sider the fol­low­ing syn­thetic use case for the pur­poses of this demo: in every organ­isa­tion, the HR depart­ment plays a pivotal role, not just in man­aging people, but also in shap­ing the work­place cul­ture, driv­ing stra­tegic tal­ent ini­ti­at­ives, and ensur­ing the best hir­ing prac­tices. Nev­er­the­less, for the HR depart­ment of the Cart­PartX organ­isa­tion, the lack of trust­worthy data was cre­at­ing ser­i­ous road­b­locks: des­pite hav­ing access to data­sets with diverse employee inform­a­tion, the depart­ment was strug­gling to make data-driven decisions due to vari­ous DQ issues, and they are now search­ing for a way to identify these issues and solve them.

The data engin­eers work­ing for the HR depart­ment have been using mul­tiple internal sys­tems to col­lect and store work­force data in Dat­ab­ricks Delta tables. In this demo we’ll be focus­ing on the Employee table which con­tains fields such as First Name, Last Name, Full Name, Sex, Title, Date of Birth, Nation­al­ity, Gen­er­a­tion, Mar­ital Status, Fam­ily Book Num­ber, Email Address, Mobile Num­ber, Per­son ID, National Iden­ti­fier, National ID Expiry Date, Pass­port Type, Pass­port Iden­ti­fier, Pass­port Expiry Date, Ori­ginal Hire Date, Entity Hire Date, etc.

The Employee table presen­ted some DQ chal­lenges such as miss­ing val­ues, incor­rect date formats, incor­rect emails, duplic­ate val­ues, and so on.

Why a Data Qual­ity Frame­work was Needed

Ini­tially, the HR depart­ment relied on ad hoc manual labour (SQL checks and Excel pivots) to detect and address these issues.  But as the data grew and more reports were built on top of unval­id­ated data, the down­stream impact became unman­age­able. The depart­ment real­ised they needed a bet­ter auto­mated DQ approach to tackle the incon­sist­ent KPIs and mis­lead­ing dash­boards, as well as to detect broken data pipelines.

Imple­ment­ing a Data Qual­ity Framework

CartPartX’s data plat­form is built on Dat­ab­ricks, so we used our Spark & GX Data Qual­ity Frame­work for Dat­ab­ricks and Cloudera, which integ­rated smoothly into their envir­on­ment. Our frame­work provided a struc­tured approach in defin­ing, execut­ing, and mon­it­or­ing DQ rules, mak­ing it an essen­tial com­pon­ent for Cart­PartX. It can equally serve any organ­isa­tion seek­ing to achieve high data reli­ab­il­ity and trust on Dat­ab­ricks, Cloudera, Snow­flake, or other rela­tional data­bases through our com­ple­ment­ary tools.

Our team sup­por­ted the deploy­ment of the frame­work in CarPartX’s Dat­ab­ricks plat­form, includ­ing the install­a­tion of the required Python depend­en­cies (great_expectations, pys­park and pan­das). The frame­work oper­ates through four key tables:

  1. dq_rule: This table stores the user-defined DQ rules to be applied.
  2. dq_rule_type: This table con­tains details of the sup­por­ted rule types and maps them to the rel­ev­ant Great Expect­a­tions expect­a­tion library.
  3. dq_validations: This table stores the exe­cu­tion res­ults of the applied rules against the related data­sets; we call them “val­id­a­tions”.
  4. dq_exceptions: This table stores the iden­ti­fi­ers of the data­set records that failed the applied rule con­di­tions; we call them “excep­tions”.

The frame­work engine reads and trans­lates the con­figured rules from dq_rule and dq_rule_type tables, then pro­cesses the source data­sets using the cor­res­pond­ing Great Expect­a­tions lib­rar­ies. The res­ults are then stored in the dq_validations and dq_exceptions tables, which serve as the found­a­tion for the Power BI DQ Dash­board reports.

We assessed the HR team’s DQ require­ments then worked with the data team to define the rules to be applied to the rel­ev­ant columns. For the Employee table alone, we cre­ated 134 rules cov­er­ing a range of checks, including:

  • Veri­fy­ing that key fields are not null (e.g. nation­al­ity) – see Fig­ure 1.
  • Identi­fy­ing format errors (e.g. email addresses in an incor­rect format) – see Fig­ure 2.
  • Ensur­ing field unique­ness, i.e. no duplic­ate records based on key fields (e.g. national iden­ti­fier) – see Fig­ure 3.
  • Val­id­at­ing cus­tom logical con­di­tions, for example, ensur­ing that a per­son iden­ti­fier also exists in a lookup table – see Fig­ure 4.
  • Con­firm­ing that field val­ues fall within a defined list of ref­er­ence val­ues (e.g. mar­ital status lim­ited to Single, Mar­ried, Divorced, or Wid­owed) – see Fig­ure 5.
  • And many more – as the frame­work sup­ports cus­tom rules, vir­tu­ally any check can be implemented.

These DQ rules can be defined in an Excel file or inser­ted dir­ectly into the dq_rule table. In the future, we plan to intro­duce a web-based applic­a­tion to sim­plify rule management.

Once the required set of rules was in place, we integ­rated the DQ Dash­board built in Power BI. To auto­mate exe­cu­tion within the Dat­ab­ricks work­flow, we cre­ated a sched­uled job to run daily, enabling the detec­tion of new errors and the mon­it­or­ing of those that have been resolved.

Res­ults Achieved

The Spark & GX DQ Frame­work greatly helped the HR depart­ment to uncover their DQ issues. Below you can see some screen­shots of the res­ults from our DQ Dashboard:

Spark & GX DQ Framework

Screen 1 – Res­ults for all rules defined for the Employee table; a selec­tion of a null check rule for the nation­al­ity field which dis­plays the rows with null nationalities.

Screen 1 shows the main page of the Power BI DQ Dash­board. Here, the over­all DQ score for the Employee table, filtered by table: Employee and depart­ment: Human Resources, is 88.3%. A total of 134 rules are defined, with 154 new excep­tions detec­ted in the latest run, and 32,000 exist­ing excep­tions car­ried over from pre­vi­ous runs. No excep­tions were resolved in the most recent exe­cu­tion. In this example, a spe­cific rule check­ing for null val­ues in the Nation­al­ity field is selec­ted, and the table on the right dis­plays the records with these errors, i.e. rows where nation­al­ity is null. Whilst this demo focuses on a single table, the frame­work sup­ports mul­tiple tables across vari­ous domains or depart­ments, depend­ing on how each organ­isa­tion struc­tures its data platform.

Power BI DQ Dashboard

Screen 2 – Fil­ter­ing to see res­ults only for rules on the email attrib­ute; a selec­tion of a spe­cific rule which dis­plays emails with format errors

Screen 2 illus­trates the use of the Attrib­ute fil­ter to dis­play only the rules applied to the Email field. Three rules are set to check format, miss­ing val­ues and a cus­tom logic con­di­tion. The rule val­id­at­ing the email format is selec­ted, and the table on the right lists the non-com­pli­ant records.

Power BI DQ Dashboard

Screen 3 – Fil­ter­ing to see only rules of type unique­ness; a selec­tion of a spe­cific rule which dis­plays duplic­ated mobile numbers

Screen 3 shows the use of the DQ Dimen­sion fil­ter to dis­play only rules of the Unique­ness type. A spe­cific rule is selec­ted, and the table on the right dis­plays the duplic­ate records with match­ing mobile numbers.

Power BI DQ Dimension Filter Dashboard

Screen 4 – Fil­ter­ing to see only rules of the logical type; a selec­tion of a spe­cific rule which shows the non-com­pli­ant rows

Screen 4 depicts rules of the Logical type, where cus­tom logical con­di­tions can be defined. In this example, a rule is selec­ted that veri­fies each employee has at least one valid assign­ment by check­ing that person_id val­ues exist in a sec­ond­ary table con­tain­ing assign­ment records. The table on the right lists employ­ees without valid assignments.

Power BI DQ Dimension Filter Dashboard

Screen 5 – Fil­ter­ing to see only rules of type valid­ity; a selec­tion of a spe­cific rule which shows the non-com­pli­ant rows

Screen 5 shows the applic­a­tion of the DQ Dimen­sion fil­ter to dis­play only Valid­ity rules to verify that the value of a given field falls within a pre­defined list of ref­er­ence val­ues. In this example, the rule for the Pass­port Type field is selec­ted, and the table on the right dis­plays the rows where the field con­tains unex­pec­ted values.

Once this dash­board had been con­sumed by the rel­ev­ant team mem­bers, the DQ issues were detec­ted and the HR team coordin­ated with the neces­sary col­leagues to fix the issues at source. As a res­ult, sub­sequent data inges­tion cycles and DQ runs reflec­ted these fixes. After just a few iter­a­tions, most of the issues were resolved, lead­ing to sig­ni­fic­antly more reli­able ana­lyt­ics and improved decision-mak­ing in the HR department.

Con­clu­sion

There is no one-size-fits-all approach to Data Qual­ity. Achiev­ing trust­worthy, reli­able data requires a solid strategy, adapt­able tool­ing, and the right expert­ise. Our DQ Accelerator provides exactly that: a struc­tured and reusable frame­work that can be adap­ted across plat­forms such as Snow­flake, Cloudera, Dat­ab­ricks, and tra­di­tional SQL data­bases, help­ing organ­isa­tions fast-track their DQ journey.

In this blog post we’ve show­cased the Spark & GX Data Qual­ity Frame­work for Dat­ab­ricks with a relat­able HR data use case. The tool demon­strates how organ­isa­tions can go bey­ond manual checks and ad hoc scripts by using an auto­mated, rule-based frame­work that integ­rates per­fectly into Dat­ab­ricks. With this frame­work, teams can:

  • Define and execute rules such as null checks, format checks, unique­ness checks, logical con­di­tions checks, valid­ity checks (against ref­er­ence val­ues), and many more.
  • Cap­ture and store res­ults in stand­ard­ised tables, ensur­ing con­sist­ency and traceability.
  • Mon­itor and visu­al­ise data qual­ity through a Power BI dash­board that high­lights over­all DQ scores, excep­tions, and trends across tables and domains.
  • Track issue res­ol­u­tion over time, dif­fer­en­ti­at­ing between new and per­sist­ent excep­tions, allow­ing val­id­a­tion when issues have been fixed at source.

In prac­tice, this means data teams can sys­tem­at­ic­ally detect prob­lems (like miss­ing val­ues, invalid emails, duplic­ate iden­ti­fi­ers, etc.), pri­or­it­ise fixes, and val­id­ate improve­ments, whilst main­tain­ing trans­par­ency and scalability.

Finally, it’s worth not­ing that once imple­men­ted, our tool­ing is free of charge; the only asso­ci­ated costs come from plat­form usage (Dat­ab­ricks com­pute, Snow­flake, Cloudera, etc.). How­ever, pub­lish­ing and shar­ing the Power BI DQ Dash­board requires the appro­pri­ate Power BI licences.

Ready to take the next step? Reach out to us to see how our DQ Accelerator can be tailored to your plat­form and busi­ness needs.