The abil­ity to trans­form and ana­lyse data effi­ciently is cru­cial for gain­ing insights that drive busi­ness decisions, and for organ­isa­tions deal­ing with large-scale data pipelines, ensur­ing a smooth, audit­able, and yet man­age­able ETL pro­cess is of para­mount import­ance. One of the best data plat­forms we’ve worked with is Cloudera Data Plat­form (CDP) which offers an extens­ive suite of open-source ser­vices to build reli­able ETL pipelines.

In a recent pro­ject with Cloudera for a cus­tomer, we developed a frame­work called DLPROC to stream­line Spark jobs in CDP. DLPROC is a light­weight but power­ful solu­tion designed to sim­plify Spark-based data trans­form­a­tions, offer­ing strong audit­ing cap­ab­il­it­ies and seam­less integ­ra­tion with cloud environments.

In this blog post we’ll walk you through deploy­ing DLPROC to trans­form data across dif­fer­ent lay­ers of a data lake using Spark SQL, allow­ing for the cre­ation of cur­ated data whilst keep­ing the frame­work straight­for­ward, scal­able, and audit­able. What’s more, DLPROC is ver­sat­ile enough to be used in both private and pub­lic cloud environments.

Why Another ETL Framework?

First, it’s import­ant to cla­rify that DLPROC is not inten­ded to com­pete with heavy­weight ETL frame­works like Cloudera’s own Cloudera Data Engin­eer­ing Data Ser­vices (CDE) or other industry-grade ETL solu­tions. There are vari­ous approaches an organ­isa­tion can take to imple­ment Spark ETLs within CDP, but we’ve often seen cus­tom­ers begin with small, cus­tom Spark ETL imple­ment­a­tions that quickly become frag­men­ted and inef­fi­cient as new use cases arise.

While an obvi­ous solu­tion is to use CDE, some organ­isa­tions face con­straints, such as hard­ware lim­it­a­tions, that pre­vent them from adopt­ing CDE or other com­pre­hens­ive ETL solu­tions. DLPROC was developed to fill this gap, offer­ing an access­ible frame­work for Spark ETLs, free from the com­plex­ity of full-scale ETL plat­forms yet cap­able of sup­port­ing effi­cient data transformations.

Let’s explore the key advant­ages of DLPROC when com­pared to scattered, in-house Spark frame­work implementations:

  1. Con­sol­id­ated Imple­ment­a­tions: DLPROC replaces frag­men­ted solu­tions by offer­ing a uni­fied plat­form for execut­ing Spark jobs, provid­ing cent­ral­ised audit­ing and mon­it­or­ing for seam­less management.
  2. Sim­pli­fied Data Trans­form­a­tion: With DLPROC, data trans­form­a­tions can be car­ried out through simple, sequen­tial Spark SQL quer­ies, mak­ing the pro­cess more effi­cient and sig­ni­fic­antly easier to debug and maintain.
  3. Enhanced Oper­a­tional Vis­ib­il­ity: Con­trol tables and audit logs are stored in a metadata data­base, giv­ing oper­a­tions teams easy access to mon­itor and man­age ETL pro­cesses from a cent­ral location.
  4. Light­weight and Flex­ible Frame­work: DLPROC is a light­weight, Python-based frame­work that is quick to deploy and easy to man­age in both private and pub­lic cloud envir­on­ments, par­tic­u­larly valu­able when data ser­vices aren’t available.

Under­stand­ing Where DLPROC Fits In

To fully grasp the func­tion­al­ity of DLPROC and its role within the data trans­form­a­tion land­scape, let’s explore a prac­tical use case in the tele­com sec­tor, the object­ive being to con­duct a net­work con­nectiv­ity ana­lysis. This involves pro­cessing data from vari­ous sources and trans­form­ing it into a con­sol­id­ated, aggreg­ated table within the gold layer.

For con­text, here’s is a visual rep­res­ent­a­tion of the data flow in our use case, high­light­ing where DLPROC fits in:

Fig­ure 1: Architecture

Let’s run through the above scen­ario in detail. For our demon­stra­tion use case, and the data lake setup in gen­eral, we are imple­ment­ing a fully private CDP Pub­lic Cloud envir­on­ment with Cloudera Data Flow (CDF) Data Ser­vice enabled on Azure. How­ever, as shown above, DLPROC is ver­sat­ile and can also be util­ised on-premises or in other cloud envir­on­ments, such as AWS.

In the ini­tial phase of this use case, data from vari­ous sources, includ­ing an Oracle data­base and SFTP serv­ers, is inges­ted into the data lake’s sil­ver layer using CDF.

Below you can see the CDF Deploy­ments dash­board, show­ing the CDF jobs pulling data from these sources. For fur­ther insights into build­ing data trans­fer jobs with CDF, you can explore our ded­ic­ated blog post here.

Fig­ure 2: CDF Jobs

Deploy­ing DLPROC

Deploy­ing DLPROC is a one-time pro­cess that can be car­ried out either dur­ing the ini­tial setup of the data lake or after­wards. Deploy­ment is not com­plic­ated and requires no external pack­ages or depend­en­cies, as the frame­work is entirely Python-based.

Deploy­ment involves three simple steps:

  1. Cre­ate the required dir­ect­ory struc­ture in the cloud (ADLS/S3) or in local stor­age (HDFS), in line with the customer’s envir­on­ment:/dlproc/logs – stores logs of the Spark exe­cu­tion/dlproc/property – stores prop­erty file for the use-case with Spark-SQLs/dlproc/scripts – stores any scripts / shell files required for the use-case
  2. Place the neces­sary frame­work-spe­cific files within the DLPROC directory.
  3. Cre­ate a metadata data­base named dlproc_datalake_monitoring to house the fol­low­ing tables: dlproc_audit, dlproc_error_details and dlproc_Spark_execution_control.

Fig­ure 3: – Metadata Database

Imple­ment­ing Spark ETL with DLPROC

Now let’s explore the four simple steps to integ­rate Spark ETL into any use case with DLPROC!

Step 1: Pre­pare the Prop­erty File

The first step is to cre­ate a prop­erty file con­tain­ing all the neces­sary Spark SQL quer­ies for the trans­form­a­tion, arranged sequen­tially. Each step must have a unique iden­ti­fier, and the final tar­get query step should be expli­citly marked.

Here’s a redac­ted ver­sion of what the prop­erty file will look like:

STEP_1_SQLQUERYSPARK-SQL QUERY 1
STEP_1_TEMPTABLE
Query1_Result_TempName
STEP_2_SQLQUERY
SPARK-SQL QUERY 2
STEP_2_TEMPTABLE
Query2_Result_TempName
STEP_3_SQLQUERY
SPARK-SQL QUERY 3
STEP_3_TEMPTABLE
Query3_Result_TempName
STEP_4_TGTQUERY
INESRT_QUERY 4
STEP_4_NEXTSTEP
END

The prop­erty file is struc­tured using spe­cific keywords to define each trans­form­a­tion step. Here are the essen­tial sections:

STEP_X_SQLQUERY: This keyword marks the start of an SQL query to be executed. The X rep­res­ents the sequence num­ber of the step, ensur­ing the proper order of execution.

STEP_X_TEMPTABLE: This defines the name of a tem­por­ary table where the res­ults of the cor­res­pond­ing SQL query will be stored. This table can then be ref­er­enced in sub­sequent queries.

STEP_X_TGTQUERY: This iden­ti­fies the final tar­get query in the sequence, which typ­ic­ally inserts or updates data in a des­ig­nated table in the gold layer.

END: This keyword indic­ates the end of the sequence of steps in the prop­erty file.

The prop­erty file struc­ture facil­it­ates the organ­isa­tion and sequen­cing of SQL quer­ies, mak­ing it simple to man­age and update ETL pro­cesses without com­plex cod­ing. By defin­ing the ETL logic within a prop­erty file, you can eas­ily make adjust­ments to quer­ies, table names, and exe­cu­tion steps, offer­ing greater flex­ib­il­ity and adapt­ab­il­ity to chan­ging data pro­cessing needs.

Step 2: Place the Prop­erty File

The prop­erty file should then be placed within the appro­pri­ate dir­ect­ory struc­ture set up dur­ing the DLPROC deploy­ment in the stor­age layer. This step ensures that the file is access­ible to the DLPROC frame­work dur­ing execution.

Step 3: Update the Spark Jobs Exe­cu­tion Con­trol Table

Next, an entry is made in the dlproc_Sparkjobs_execution_control table which is part of the DLPROC con­trol frame­work. This entry will man­age the job exe­cu­tion based on the Oozie sched­ule. The table con­tains cru­cial inform­a­tion such as the flow name, start and end times, and the status of the last run.

Here’s a sample SQL statement:

INSERT INTO dlproc_monitoring.dlproc_Sparkjobs_execution_control VALUES ('gold_network_connectivity_analysis', '2024-03-04 00:00:00', '2024-03-11 00:00:00', NOW(), 'FAILED');

This entry will be updated after each run by the frame­work, ensur­ing that the job is ready for the next sched­uled execution.

Step 4: Cre­ate and Sched­ule the Oozie Workflow

In the Oozie work­flow editor, start by cre­at­ing a new work­flow based on an exist­ing tem­plate that includes an SH action. This action is designed to trig­ger the DLPROC job by execut­ing the rel­ev­ant shell script within the DLPROC frame­work. The only para­meter to be mod­i­fied is the ‘argu­ment‘, which spe­cifies the name of the prop­erty file asso­ci­ated with the use case.

Next, save the work­flow with a descript­ive name to facil­it­ate iden­ti­fic­a­tion. The work­flow is then sched­uled using a crontab entry that matches the desired exe­cu­tion time. Ensure that the time zone is adjus­ted to the local time, if necessary.

This seam­lessly integ­rates the Spark job into the DLPROC frame­work, enabling sched­uled job exe­cu­tions along with built-in audit­ing and error-hand­ling mech­an­isms. The trans­form­a­tion logic is applied to the sil­ver layer tables, with the res­ults being writ­ten into the gold layer based on a pre­defined schedule.

Key Fea­tures of DLPROC

Plug­ging mul­tiple Spark-SQL ETLs

DLPROC sim­pli­fies integ­rat­ing Spark-SQL-based ETL work­flows by fol­low­ing a straight­for­ward four-step pro­cess, enabling seam­less schedul­ing, audit­ing, and mon­it­or­ing of ETLs within the framework.

Error Log­ging

The frame­work includes a robust error-log­ging fea­ture that cap­tures and records issues encountered dur­ing the ETL pro­cess, ensur­ing a detailed audit trail. Spark logs are con­veni­ently stored in a des­ig­nated dir­ect­ory within the stor­age layer, enabling easy access for sup­port teams.

Fig­ure 4: Spark Logs

The dlproc_audit logs vital inform­a­tion such as the flow name, step, error mes­sage, and Spark applic­a­tion ID, ensur­ing trace­ab­il­ity and sim­pli­fy­ing debug­ging. It also ensures that every major action in the ETL pro­cess, from execut­ing Spark SQL quer­ies to insert­ing data into tar­get tables, is logged with a status indic­ator and timestamp, provid­ing a com­pre­hens­ive over­view of the workflow.

Below is a sample audit log mes­sage clearly indic­at­ing an error:

Fig­ure 5: Audit Log Entry

Hav­ing an SQL table that audits and logs all Spark errors makes it easy to cre­ate dash­boards for sup­port and mon­it­or­ing teams. It also reduces the effort needed to sift through Spark logs manu­ally to identify root causes.

Sup­port for Mul­tiple Metadata Databases

DLPROC also sup­ports host­ing the metadata data­base in Post­gr­eSQL, offer­ing flex­ib­il­ity for organ­isa­tions with spe­cific pref­er­ences for metadata stor­age. For instance, cus­tom­ers uncom­fort­able host­ing audit and con­fig­ur­a­tion tables within the data lake (e.g., in Hive or Kudu) can now eas­ily use Post­gr­eSQL as an alternative.

Con­clu­sion

In con­clu­sion, DLPROC is a robust and adapt­able frame­work designed to stream­line Spark ETL pro­cesses in mod­ern data envir­on­ments, not only sim­pli­fy­ing the com­plex­ity of data oper­a­tions but also provid­ing the reli­ab­il­ity and con­trol needed to man­age, mon­itor, and optim­ise ETL work­flows efficiently.

With its com­pre­hens­ive error log­ging and detailed step-by-step track­ing, DLPROC ensures that every aspect of the data trans­form­a­tion jour­ney is trans­par­ent and trace­able, sig­ni­fic­antly redu­cing the time and effort required for troubleshoot­ing. Its flex­ib­il­ity in sup­port­ing mul­tiple data­bases, such as Post­gr­eSQL, Kudu or Hive fur­ther enhances its versatility.

These cap­ab­il­it­ies make DLPROC an ideal, free, and light­weight altern­at­ive for organ­isa­tions nav­ig­at­ing diverse tech­no­lo­gical eco­sys­tems. It par­tic­u­larly bene­fits those rely­ing on scattered in-house solu­tions or unwill­ing to invest in more com­plex ETL frame­works for Spark, such as Cloudera Data Engineering.

If you would like fur­ther inform­a­tion about this solu­tion and whether it can meet your spe­cific needs, don’t hes­it­ate to con­tact us – we’ll be happy to help!