Data Quality Accelerator

In our previous article, we dove into the whats and whys of Data Quality (DQ), and saw that it isn’t just a buzzword, it’s the backbone of trustworthy data analytics, AI, and decision-making. Ignoring Data Quality isn’t just risky, it’s expensive: according to Gartner, poor DQ costs businesses $12.9 million per year on average, as inaccurate customer insights lead to bad decisions, compliance fines, etc.

Flow diagram illustrating the impact of bad data quality—resulting in misleading insights, poor business decisions, and financial losses.

In this blog post we’ll continue reviewing Data Quality, but we’ll focus on the hows. As we saw in our recent article How to Operate your Enterprise Data Platform – Meet Our EDPOps Accelerator, one of the first choices to be made when implementing a DQ solution in an organisation is whether to build it or to buy it. Both choices offer their pros and cons and, as with many aspects of any data analytics journey, which is the best for a particular organisation depends on various factors.

Here at synvert we have worked on both approaches during our years of experience helping our customers. On the one hand we have worked with a “buy” approach, i.e. implementing DQ solutions using Ataccama, Informatica or Collibra – all excellent choices that go beyond DQ, covering other data governance aspects such as data cataloguing. On the other hand, we have also helped customers with a “build” approach, crafting various DQ solutions in different tech stacks, some of which we have already presented in our blog: see our posts about DataWash, a DQ solution for Snowflake (first and second post), and our article about a solution for Cloudera using Python and Great Expectations.

A key consideration when building a DQ solution is that our high-level approach remains consistent, regardless of the technology stack used. By combining this approach with the tools we have implemented across different environments and our extensive expertise in Data Governance too, we are pleased to introduce our Data Quality Accelerator, a cost-effective and adaptable solution for accelerating DQ!

Let’s take a closer look at our DQ Accelerator. We’ll begin by recommending our Data Governance and Strategy Assessment as the first step, and then we’ll outline the core components of the common approach, followed by a brief overview of the reusable, adaptable, and extendable tools we’ve developed. Finally, we’ll present a demo of our tool for Databricks, which integrates Python, Great Expectations, and Power BI. We’ll also see how our accelerator can significantly speed up your data quality journey: the Databricks version showcased in this post was adapted and enhanced from a solution originally built for Cloudera, enabling our customer to quickly implement a strong DQ framework with minimal development effort.

Our Data Quality Accelerator blends:

Diagramm of Assessment, Reusable Tooling and Approach.

Data Quality Accelerator – The Assessment

Whilst it may be tempting to dive straight into reusing our existing DQ tools, we recommend beginning your DQ journey with an assessment to identify the most suitable plan for your organisation. In particular, we suggest our broader Data Governance and Strategy Assessment, as even if your immediate focus is on DQ, any plan must align with your organisation’s overall Data Governance and Data Strategy initiatives. Our experts can support you in establishing a comprehensive Data Governance programme that encompasses not only Data Quality, but also Data Cataloguing, Master and Reference Data Management, DevOps, Data Stewardship, and the overall strategy, oversight, and control framework.

Data Quality Accelerator – The Approach

Once we have determined that the best for your organisation is to reuse our available tooling to accelerate your DQ journey, it is important to explain the common approach we apply across all implementations, regardless of the technology stack:

Rules are defined to specify which checks will be performed on the datasets.
An engine periodically executes these checks.
Results, following a common base data model, are generated by the engine and accumulated over successive executions. These accumulated results enable the monitoring of when issues are resolved.
Front-end applications are built to consume the results and trigger alerts accordingly.

This is how our DQ Approach works:

Flow Diagram of Data Quality Accelerator

Data Quality Accelerator – Reusable Tooling

We have built various tools to implement the above approach for different tech stacks. Whilst the tools share the same underlying methodology, they have been adapted to different customers and their demands.

DataWash for Snowflake: DataWash uses Snowpark as the engine and Streamlit as the front-end application for both rule management and the visualisation of results. It was the first data quality tool we developed, initially released over two years ago (see blog posts part1 and part2) and it has been evolving ever since. Check out our recent webinar on DataWash 2.0.
Spark & GX Data Quality Framework for Databricks and Cloudera: As mentioned above, this framework was first developed for Cloudera, and we have recently adapted and extended it to work with Databricks for one of our key customers. The engine uses Python with Great Expectations in Spark environments, and Power BI is used for result visualisation.
SQL Data Quality Framework for RDBMSs: For environments relying on traditional SQL databases, such as that of one of our largest customers, we have created a framework that implements the DQ approach described above, using SQL and stored procedures as the engine. For visualisation, we built a dashboard in the customer’s reporting tool, Oracle Business Intelligence Suite Enterprise Edition (OBIEE). The solution is currently running and has been tested in Oracle, but it can be used with any RDBMS that supports stored procedures. Moreover, since the data model used to store results has the same base as in the Databricks and Cloudera framework, we could easily adapt the Power BI dashboard developed there to work here too.

All the above tools are reusable and adaptable; in fact, we’ve already reused and adapted them several times. As mentioned, we extended the framework initially created for Cloudera to work with Databricks, and we also reused the Power BI dashboard developed for Databricks/Cloudera within the SQL framework, thanks to the shared base data model. This means we can readily adapt these tools to meet your specific needs.

All these tools are live and actively used in production, evolving across various customer environments. Current and planned enhancements include the use of AI to recommend new rules, simplified rule management through intuitive front-end applications, rule onboarding lifecycle management, and the introduction of alerting systems that go beyond dashboards to deliver results.

In the next section we’ll go through a demo use case for the Spark & GX Data Quality Framework for Databricks and Cloudera. As we already demoed the earlier version that only worked for Cloudera in a previous blog post, now we’ll look at the newer version that works with Databricks too. In a future article, we’ll demo the SQL Data Quality Framework for RDBMSs as well.

Demo – Spark & GX Data Quality Framework for Databricks & Cloudera

Let’s consider the following synthetic use case for the purposes of this demo: in every organisation, the HR department plays a pivotal role, not just in managing people, but also in shaping the workplace culture, driving strategic talent initiatives, and ensuring the best hiring practices. Nevertheless, for the HR department of the CartPartX organisation, the lack of trustworthy data was creating serious roadblocks: despite having access to datasets with diverse employee information, the department was struggling to make data-driven decisions due to various DQ issues, and they are now searching for a way to identify these issues and solve them.

The data engineers working for the HR department have been using multiple internal systems to collect and store workforce data in Databricks Delta tables. In this demo we’ll be focusing on the Employee table which contains fields such as First Name, Last Name, Full Name, Sex, Title, Date of Birth, Nationality, Generation, Marital Status, Family Book Number, Email Address, Mobile Number, Person ID, National Identifier, National ID Expiry Date, Passport Type, Passport Identifier, Passport Expiry Date, Original Hire Date, Entity Hire Date, etc.

The Employee table presented some DQ challenges such as missing values, incorrect date formats, incorrect emails, duplicate values, and so on.

Why a Data Quality Framework was Needed

Initially, the HR department relied on ad hoc manual labour (SQL checks and Excel pivots) to detect and address these issues. But as the data grew and more reports were built on top of unvalidated data, the downstream impact became unmanageable. The department realised they needed a better automated DQ approach to tackle the inconsistent KPIs and misleading dashboards, as well as to detect broken data pipelines.

Implementing a Data Quality Framework

CartPartX’s data platform is built on Databricks, so we used our Spark & GX Data Quality Framework for Databricks and Cloudera, which integrated smoothly into their environment. Our framework provided a structured approach in defining, executing, and monitoring DQ rules, making it an essential component for CartPartX. It can equally serve any organisation seeking to achieve high data reliability and trust on Databricks, Cloudera, Snowflake, or other relational databases through our complementary tools.

Our team supported the deployment of the framework in CarPartX’s Databricks platform, including the installation of the required Python dependencies (great_expectations, pyspark and pandas). The framework operates through four key tables:

dq_rule: This table stores the user-defined DQ rules to be applied.
dq_rule_type: This table contains details of the supported rule types and maps them to the relevant Great Expectations expectation library.
dq_validations: This table stores the execution results of the applied rules against the related datasets; we call them “validations”.
dq_exceptions: This table stores the identifiers of the dataset records that failed the applied rule conditions; we call them “exceptions”.

The framework engine reads and translates the configured rules from dq_rule and dq_rule_type tables, then processes the source datasets using the corresponding Great Expectations libraries. The results are then stored in the dq_validations and dq_exceptions tables, which serve as the foundation for the Power BI DQ Dashboard reports.

We assessed the HR team’s DQ requirements then worked with the data team to define the rules to be applied to the relevant columns. For the Employee table alone, we created 134 rules covering a range of checks, including:

Verifying that key fields are not null (e.g. nationality) – see Figure 1.
Identifying format errors (e.g. email addresses in an incorrect format) – see Figure 2.
Ensuring field uniqueness, i.e. no duplicate records based on key fields (e.g. national identifier) – see Figure 3.
Validating custom logical conditions, for example, ensuring that a person identifier also exists in a lookup table – see Figure 4.
Confirming that field values fall within a defined list of reference values (e.g. marital status limited to Single, Married, Divorced, or Widowed) – see Figure 5.
And many more – as the framework supports custom rules, virtually any check can be implemented.

These DQ rules can be defined in an Excel file or inserted directly into the dq_rule table. In the future, we plan to introduce a web-based application to simplify rule management.

Once the required set of rules was in place, we integrated the DQ Dashboard built in Power BI. To automate execution within the Databricks workflow, we created a scheduled job to run daily, enabling the detection of new errors and the monitoring of those that have been resolved.

Results Achieved

The Spark & GX DQ Framework greatly helped the HR department to uncover their DQ issues. Below you can see some screenshots of the results from our DQ Dashboard:

Screen 1 – Results for all rules defined for the Employee table; a selection of a null check rule for the nationality field which displays the rows with null nationalities.

Screen 1 shows the main page of the Power BI DQ Dashboard. Here, the overall DQ score for the Employee table, filtered by table: Employee and department: Human Resources, is 88.3%. A total of 134 rules are defined, with 154 new exceptions detected in the latest run, and 32,000 existing exceptions carried over from previous runs. No exceptions were resolved in the most recent execution. In this example, a specific rule checking for null values in the Nationality field is selected, and the table on the right displays the records with these errors, i.e. rows where nationality is null. Whilst this demo focuses on a single table, the framework supports multiple tables across various domains or departments, depending on how each organisation structures its data platform.

Screen 2 – Filtering to see results only for rules on the email attribute; a selection of a specific rule which displays emails with format errors

Screen 2 illustrates the use of the Attribute filter to display only the rules applied to the Email field. Three rules are set to check format, missing values and a custom logic condition. The rule validating the email format is selected, and the table on the right lists the non-compliant records.

Screen 3 – Filtering to see only rules of type uniqueness; a selection of a specific rule which displays duplicated mobile numbers

Screen 3 shows the use of the DQ Dimension filter to display only rules of the Uniqueness type. A specific rule is selected, and the table on the right displays the duplicate records with matching mobile numbers.

Screen 4 – Filtering to see only rules of the logical type; a selection of a specific rule which shows the non-compliant rows

Screen 4 depicts rules of the Logical type, where custom logical conditions can be defined. In this example, a rule is selected that verifies each employee has at least one valid assignment by checking that person_id values exist in a secondary table containing assignment records. The table on the right lists employees without valid assignments.

Screen 5 – Filtering to see only rules of type validity; a selection of a specific rule which shows the non-compliant rows

Screen 5 shows the application of the DQ Dimension filter to display only Validity rules to verify that the value of a given field falls within a predefined list of reference values. In this example, the rule for the Passport Type field is selected, and the table on the right displays the rows where the field contains unexpected values.

Once this dashboard had been consumed by the relevant team members, the DQ issues were detected and the HR team coordinated with the necessary colleagues to fix the issues at source. As a result, subsequent data ingestion cycles and DQ runs reflected these fixes. After just a few iterations, most of the issues were resolved, leading to significantly more reliable analytics and improved decision-making in the HR department.

Conclusion

There is no one-size-fits-all approach to Data Quality. Achieving trustworthy, reliable data requires a solid strategy, adaptable tooling, and the right expertise. Our DQ Accelerator provides exactly that: a structured and reusable framework that can be adapted across platforms such as Snowflake, Cloudera, Databricks, and traditional SQL databases, helping organisations fast-track their DQ journey.

In this blog post we’ve showcased the Spark & GX Data Quality Framework for Databricks with a relatable HR data use case. The tool demonstrates how organisations can go beyond manual checks and ad hoc scripts by using an automated, rule-based framework that integrates perfectly into Databricks. With this framework, teams can:

Define and execute rules such as null checks, format checks, uniqueness checks, logical conditions checks, validity checks (against reference values), and many more.
Capture and store results in standardised tables, ensuring consistency and traceability.
Monitor and visualise data quality through a Power BI dashboard that highlights overall DQ scores, exceptions, and trends across tables and domains.
Track issue resolution over time, differentiating between new and persistent exceptions, allowing validation when issues have been fixed at source.

In practice, this means data teams can systematically detect problems (like missing values, invalid emails, duplicate identifiers, etc.), prioritise fixes, and validate improvements, whilst maintaining transparency and scalability.

Finally, it’s worth noting that once implemented, our tooling is free of charge; the only associated costs come from platform usage (Databricks compute, Snowflake, Cloudera, etc.). However, publishing and sharing the Power BI DQ Dashboard requires the appropriate Power BI licences.

Ready to take the next step? Reach out to us to see how our DQ Accelerator can be tailored to your platform and business needs.