Intro­duc­tion

Azure is a cloud com­put­ing plat­form with a wide range of ser­vices that provide solu­tions for vari­ous busi­ness applic­a­tions. This includes data­base host­ing, remote stor­age, ETL ser­vices, etc. In this blog we take a com­pre­hens­ive look at Azure Data Fact­ory in com­bin­a­tion with Databricks.

Azure Data Fact­ory is a use­ful tool for per­form­ing ETL pro­cesses, and as such per­forms the task of mov­ing data from vari­ous sources into a data lake or data ware­house. It helps us to eas­ily orches­trate the data move­ment and work with the data that is then fur­ther used by end users/systems such as BI tools. It also allows us to cre­ate data-driven work­flows called data pipelines. Thanks to the visual tools, it is very easy for us to con­struct ETL processes.

In the fol­low­ing art­icle, we will go through all the steps to cre­ate an ETL pipeline with Data Fact­ory using Dat­ab­ricks Python Activity.

Pre­requis­ites

1. The Azure Data Fact­ory resource should be cre­ated and con­figured using Git­hub or Azure DevOps in the Azure portal.

2. A Dat­ab­ricks work­space should be con­figured in the Azure sub­scrip­tion, with a cluster and a file con­tain­ing the trans­form­a­tion logic.

3. Azure Data Lake Stor­age account is con­figured in Azure Subscription.

Per­form a data ingest using the Copy Data tool to copy raw data

1. Nav­ig­ate to the Azure Data Fact­ory home page in your browser and cre­ate a new pipeline.

2. Next, under Activ­it­ies → Move and Trans­form, select the Copy Data Activ­ity as shown in the fig­ure below and drag it to the workspace.

3. Cre­ate a linked ser­vice to con­nect to the source data­set. In this step, we need to fill in the required details to con­nect to the source dataset.

Azure sup­ports con­nec­tion to many external data sources along with nat­ive Azure data sources.

4. Sim­il­arly, we cre­ate a linked ser­vice to con­nect to the Sink data­set. In our case, it should be Azure Data Lake Stor­age so we can use it later for trans­form­a­tion with Dat­ab­ricks. We should have already cre­ated a con­tainer in ADLS to store the Sink data­set which will be used in this step.

5. We can test the con­nec­tions to make sure they work by select­ing a record in the left nav­ig­a­tion bar and then using the Test Con­nec­tion but­ton, as shown in the figure.

Once this is com­plete, we can run this part in debug mode to be sure that the data copy tool is work­ing. We should check the sink con­tainer in ADLS to see if the data copy tool worked cor­rectly and if we have new data after run­ning it. In the next steps, we will look at how we can trans­form this data using the Dat­ab­ricks Python Activity.

Per­form a trans­form­a­tion job using Dat­ab­ricks Python Activity

1. Once the data copy activ­ity is imple­men­ted, we can add a Dat­ab­ricks Python Activ­ity job that will trans­form the data inges­ted by the Copy Data tool job. Dat­ab­ricks provides a uni­fied, open plat­form that enables power­ful ETL ana­lyt­ics and machine learn­ing. There­fore, we are tak­ing the oppor­tun­ity here to incor­por­ate Dat­ab­ricks activ­it­ies into the Data Fact­ory pipeline.

The blocks can be con­nec­ted by drag­ging an arrow from the green square to the right of the Copy data block to the Python trans­form job, which indic­ates the exe­cu­tion path of the end-to-end pipeline.

2. Then we need to con­fig­ure the Dat­ab­ricks Python Activ­ity. First, we need to con­nect this Activ­ity to a Dat­ab­ricks linked ser­vice. To do this, we need to cre­ate a Dat­ab­ricks linked ser­vice or use an exist­ing one. The linked ser­vice will help us con­nect to the Dat­ab­ricks work­space and run the Python Activ­ity by deploy­ing a con­figured Dat­ab­ricks cluster. The linked ser­vice should be tested for a suc­cess­ful con­nec­tion. Import­ant para­met­ers for con­nect­ing to the Dat­ab­ricks work­space are the work­space name, work­space URL, authen­tic­a­tion type, and other con­fig­ur­a­tions of the cluster such as node type and autoscaling.

3. The next step is to spe­cify the DBFS path of the Python file to be executed. This would be the main file that will be executed by this Activ­ity. This file must con­tain the trans­form­a­tion logic to trans­form the inges­ted data and write the trans­formed data back to Azure Data Lake Storage.

4. The next step is to spe­cify the optional para­met­ers and lib­rar­ies if required. Dif­fer­ent lib­rary types can be used, e.g. pypi, python wheels, etc. After spe­cify­ing the lib­rary type, we also need to spe­cify the DBFS URI of the python lib­rary or the pack­age name of pypi. These are depend­ent lib­rar­ies required by the main python file as shown in the fig­ure below.

5. Once these set­tings are determ­ined. Open the Dat­ab­ricks work­space and check the DBFS paths of the main Python file and the Python wheel file set in the data fact­ory as shown in the fig­ure below.

Our end-to-end pipeline is now com­plete. We need to val­id­ate, test, and pub­lish our pipeline. The pipeline can be val­id­ated using the val­id­ate but­ton. Once val­id­a­tion is com­plete, the pipeline can be run in debug mode to test the res­ults. Once we have the trans­formed data in ADLS, we can verify the cor­rect­ness of the trans­formed data.
After test­ing, the pipeline can be pub­lished using the pub­lish com­mand. The pub­lish com­mand merges the cur­rent branch with the pub­lish branch. All pro­duc­tion-ready work­loads should be merged into the pub­lish branch for reg­u­lar deployment.

The last step is to add a trig­ger that sets the exe­cu­tion fre­quency of the pipeline. Here we need to spe­cify the start date, the time zone, and the repe­ti­tion of the job.

Con­clu­sion

We have covered all the steps to suc­cess­fully cre­ate an ETL pipeline using Dat­ab­ricks orches­trated by Azure Data Fact­ory. As shown above, man­aging the pipeline becomes quite easy using Data Factory’s GUI tools. Mon­it­or­ing the pipeline is also quite easy by assign­ing these pipelines to Azure Mon­itor. With the help of Azure Mon­itor, alerts can be eas­ily cre­ated to inform us imme­di­ately about any error in the pipeline execution.

The abil­ity to include Dat­ab­ricks allows us to write com­plex trans­form­a­tion code in Spark, which is not pos­sible with nat­ive Data Fact­ory tools. There­fore, a com­bin­a­tion of Data Fact­ory and Dat­ab­ricks is per­fect for cre­at­ing com­plex ETL pipelines in Azure.