This blog post is part of a series that cov­ers the devel­op­ment of an end-to-end Machine Learn­ing Oper­a­tions (MLOps) solu­tion using vari­ous tech­no­lo­gies or cloud plat­forms. In our pre­vi­ous blog post about end-to-end MLOps in Azure, we covered con­cepts such as using CI/CD for auto­mat­ing model test­ing and pro­mo­tion, intro­duced MLflow, and also presen­ted a use case with the Mar­ket­ing Cam­paign data­set from Kaggle.

In this art­icle, we’ll show you how to build an end-to-end MLOps pipeline with Dat­ab­ricks and Git­Hub Actions, using the same approach and data as in the pre­vi­ous blog post.

Why Dat­ab­ricks?

This is a recur­rent ques­tion every tech adop­ter faces when con­sid­er­ing which tech­no­logy to use, and there are a lot of reas­ons for choos­ing Dat­ab­ricks:

  • Uni­fied plat­form: Dat­ab­ricks provides a uni­fied ana­lyt­ics plat­form that integ­rates with major data sources, allow­ing data engin­eers, data sci­ent­ists, and ML engin­eers to col­lab­or­ate seam­lessly in build­ing, deploy­ing, and man­aging ML models.
  • Scalab­il­ity: Dat­ab­ricks offers scal­able infra­struc­ture with dis­trib­uted com­put­ing cap­ab­il­it­ies, which is cru­cial for hand­ling large data­sets and train­ing com­plex ML mod­els efficiently.
  • Man­aged Spark envir­on­ment: Dat­ab­ricks provides a man­aged Apache Spark envir­on­ment, which sim­pli­fies the pro­cess of set­ting up, con­fig­ur­ing, and main­tain­ing Spark clusters for data pro­cessing and model training.
  • Integ­ra­tion with MLflow: Dat­ab­ricks integ­rates flaw­lessly with MLflow, an open-source plat­form cre­ated by Dat­ab­ricks for man­aging the end-to-end ML life­cycle. MLflow provides track­ing, exper­i­ment­a­tion, and model man­age­ment cap­ab­il­it­ies, essen­tial for imple­ment­ing MLOps best practices.
  • ML reusab­il­ity: Dat­ab­ricks provides a Fea­ture Store as a cent­ral­ised repos­it­ory, allow­ing data sci­ent­ists to store and share fea­tures used for model train­ing and infer­ence, enabling dis­cov­er­ab­il­ity and reusability.
  • Col­lab­or­a­tion and ver­sion con­trol: Dat­ab­ricks provides col­lab­or­a­tion fea­tures such as note­book shar­ing and ver­sion con­trol, allow­ing teams to work together effect­ively and to track changes to code and mod­els over time.
  • Integ­ra­tion with other tools: Dat­ab­ricks also integ­rates with a wide range of tools and frame­works com­monly used in the ML eco­sys­tem, such as Tensor­FlowPyT­orchscikit-learn, and more, mak­ing it both flex­ible and adapt­able to dif­fer­ent work­flows and technologies.
  • Auto­mated ML (AutoML) cap­ab­il­it­ies: Dat­ab­ricks offers AutoML cap­ab­il­it­ies through tools like MLflow and auto­mated fea­ture engin­eer­ing lib­rar­ies, enabling data sci­ent­ists to auto­mate repet­it­ive tasks and to accel­er­ate the model devel­op­ment process.
  • Secur­ity and com­pli­ance: Dat­ab­ricks boasts robust secur­ity fea­tures and com­pli­ance cer­ti­fic­a­tions, ensur­ing that sens­it­ive data and mod­els are pro­tec­ted and also meet reg­u­lat­ory requirements.

Over­all, Dat­ab­ricks offers a com­pre­hens­ive plat­form that effect­ively addresses the key chal­lenges in deploy­ing and man­aging ML mod­els at scale, mak­ing it ideal for MLOps workflows.

Solu­tion Overview

Now we’re going to present our solu­tion to cre­ate a Dat­ab­ricks work­flow that builds and devel­ops an ML model using MLflow, lever­aging the model train­ing code dis­cussed in our pre­vi­ous blog post. This MLOps CI/CD work­flow involves the fol­low­ing steps:

1) Devel­op­ing a model when a change is triggered in the Git­Hub repos­it­ory that stores the model note­book, stor­ing the newer model ver­sion in the Dat­ab­ricks Model Registry.

2) Per­form­ing data val­id­a­tions and eval­u­at­ing per­form­ance met­rics with the newer model.

3) Com­par­ing the per­form­ance met­rics of the new model with the pro­duc­tion model, pro­mot­ing the new model if it per­forms better.

In con­junc­tion with this MLOps CI/CD pipeline, we will also mon­itor the data drift in our model using the Dat­ab­ricks Lake­house Mon­it­or­ing fea­ture, which is in Pre­view at the time of writing.

The Dat­ab­ricks MLOps Pipeline

There are mul­tiple approaches to imple­ment­ing an MLOps pipeline, and although vari­ous approaches might be suit­able for your use cases, we have designed and imple­men­ted one pos­sible approach, for a single work­space envir­on­ment, that encap­su­lates the three basic steps any MLOps pipeline needs: build­ing, test­ing, and pro­mot­ing a model.

Fig­ure 1: MLOps Pipeline Steps

These three steps are con­tained in dif­fer­ent note­books, as explained below, and cor­res­pond to the steps out­lined in the solu­tion overview:

  • Build (create_model_version note­book): Read­ing raw data, apply­ing pre­pro­cessing tech­niques for fea­tur­isa­tion, stor­ing the pro­cessed fea­tures to be used in the model train­ing in the Fea­ture Store, train­ing the model, gen­er­at­ing pre­dic­tions with the trained model, and using MLflow to cre­ate a new model ver­sion in the Model Registry.
  • Test (test_model note­book): Retriev­ing the new model with MLflow, com­par­ing the input and out­put schemas with the expec­ted schemas, and using the fea­tures from the Fea­ture Store to gen­er­ate predictions.
  • Pro­mote (promote_model note­book): Com­par­ing the met­rics (accur­acy and F1 score) of the new model and the model in pro­duc­tion, pro­mot­ing the new model if it per­forms better.

We used Dat­ab­ricks Work­flows and Git­Hub Actions to execute these three steps. Every time a push in the Git­Hub repository’s main branch or a PR is opened, a job run is cre­ated to execute these note­books consecutively.

For the sake of sim­pli­city, we use Dat­ab­ricks Repos to manu­ally clone our Git­Hub repos­it­ory into a folder in our work­space. This folder is ref­er­enced in the work­flow to point each task to the cor­res­pond­ing note­book, so it is neces­sary to pull the cor­res­pond­ing branch (PR or main) in the Dat­ab­ricks Repo before run­ning the work­flow with the three tasks. To achieve this, our Git­Hub Action must down­load the Dat­ab­ricks CLI, update the Dat­ab­ricks Repo to the cor­res­pond­ing branch, and then cre­ate and run the job:

name: MLOps workflow

on:
  push:
    branches:
      - main
  pull_request:
    types: [opened, synchronize, reopened]
    branches:
      - main

jobs:
  build_and_deploy_job:
    if: github.ref == 'refs/heads/main' || github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    name: Build and Deploy Job
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: true
          lfs: false

      - name: Set up Databricks CLI
        uses: databricks/setup-cli@main

      - name: Checkout to repo in Databricks Workspace
        env:
          DATABRICKS_HOST: ${{ vars.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
        run: |
          databricks repos update {your_repo_id} --branch ${{ github.head_ref || github.ref_name }}

      - name: Create Databricks MLOps job
        env:
          DATABRICKS_HOST: ${{ vars.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
        run: |
          databricks jobs submit --json @job.json

Note that it is neces­sary to add the DATABRICKS_HOST vari­able and the DATABRICKS_TOKEN in the Git­Hub repos­it­ory set­tings. The {your_repo_id} should also be replaced with the cor­res­pond­ing repos­it­ory ID, which can be obtained using the Dat­ab­ricks CLI repos list command.

The work­flow job run cre­ated is con­figured in a JSON. The job con­tains three note­book tasks with the cor­res­pond­ing depend­en­cies and ACLs. In this example we are using an exist­ing cluster, but we strongly recom­mend using job clusters as they are more economical:

{
  "tasks": [
    {
      "task_key": "create_model_version",
      "description": "Create a new model version with the corresponding changes",
      "timeout_seconds": 600,
      "notebook_task": {
        "notebook_path": "/Repos/Production/marketing/create_model_version ",
        "source":"WORKSPACE"
      },
      "depends_on": [],
      "existing_cluster_id": "{your_cluster_id}"
    },
    {
      "task_key": "test_model",
      "description": "Test new model version",
      "timeout_seconds": 120,
      "notebook_task": {
        "notebook_path": "/Repos/Production/marketing/test_model",
        "source":"WORKSPACE"
      },
      "depends_on": [
        {
          "task_key": "create_model_version"
        }
      ],
      "existing_cluster_id": "{your_cluster_id}",
      "run_if": "ALL_SUCCESS"
    },
    {
      "task_key": "promote_model",
      "description": "Promote model into Production stage if presents better results",
      "timeout_seconds": 120,
      "notebook_task": {
        "notebook_path": "/Repos/Production/marketing/promote_model",
        "source":"WORKSPACE"
      },
      "depends_on": [
        {
          "task_key": "test_model"
        }
      ],
      "existing_cluster_id": "{your_cluster_id}",
      "run_if": "ALL_SUCCESS"
    }
  ],
  "run_name": "Marketing Model MLOps",
  "access_control_list": [
    {
      "user_name": "info@clearpeaks.com",
      "permission_level": "IS_OWNER"
    }
  ]
}

The cre­ated work­flow looks like this in the Dat­ab­ricks Work­flows UI:

Fig­ure 2: MLOps work­flow for build, test & promote

With this pipeline in place, we can be sure that the best ver­sion of our model is going to be in Pro­duc­tion, avail­able for inference.

Fea­ture Store

One Dat­ab­ricks cap­ab­il­ity that we have used in this demon­stra­tion is the Fea­ture Store. Serving as a cent­ral­ised repos­it­ory, the Fea­ture Store facil­it­ates the stor­age and shar­ing of fea­tures for model train­ing and inference.

Fig­ure 3: Dat­ab­ricks Fea­ture Store

Before train­ing our model, after all the fea­tur­isa­tion has been car­ried out, we con­vert our pan­das code to PyS­park and use the Fea­tureStore­Cli­ent lib­rary to cre­ate the Delta table:

fs.create_table(
    name='feature_store.marketing_available_features',
    primary_keys='ID',
    df=df,
    description='Marketing available features'
)

Once our fea­tures have been stored, we can use the Fea­ture Store in the train­ing note­book for model devel­op­ment, and in the test­ing note­book to gen­er­ate pre­dic­tions with the new model version:

from databricks.feature_store import FeatureStoreClient

# Read data from Feature Store
fs = FeatureStoreClient()
df = fs.read_table(
  name='feature_store.marketing_available_features',
).toPandas()

predictions = pyfunc_model.predict(df)
predictions

With this approach, we are reusing the stored fea­tures in the test­ing phase without need­ing to pre-pro­cess any data again to per­form our tests. Moreover, these fea­tures are avail­able in the Machine Learn­ing sec­tion of the Dat­ab­ricks UI, under the Fea­tures tab. This enables users to check all the stored fea­tures, their pro­du­cers, metadata such as tags and created/modified dates, and the fea­tures’ schema. These fea­tures can also be pub­lished online to share and to mon­et­ise your data!

Fig­ure 4: Fea­ture Store UI

Drift Mon­it­or­ing in Databricks

A cru­cial part of MLOps, besides auto­ma­tion for model deploy­ment, is the mon­it­or­ing of data drift and model per­form­ance drift over time to main­tain pre­dic­tion quality:

Fig­ure 5: Dat­ab­ricks Drift Detection

Dat­ab­ricks provides a mon­it­or­ing fea­ture called Dat­ab­ricks Lake­house Mon­it­or­ing that allows you to check the stat­ist­ics and qual­ity of your data. It can be used both for data qual­ity mon­it­or­ing and for infer­ence mon­it­or­ing. This fea­ture is in Pre­view at the time of writ­ing, but we still wanted to share its poten­tial with you!

In our case, we used another sub­set of the ori­ginal data­set to per­form some infer­ences with dif­fer­ent ver­sions of the model across vari­ous time win­dows. These infer­ences have been stored in a Delta Table, cre­at­ing some addi­tional columns such as:

– Pre­dic­tion column, to store the infer­ence result.

– Model ID column, to spe­cify the model ver­sion used.

– Timestamp column, to record when the infer­ence was made.

– Label column, to store the true value of the pre­dic­tion, which is avail­able in our case. This column is optional if the true value isn’t avail­able (e.g., acquir­ing the true value can be chal­len­ging in mod­els served online).

This is the code:

from databricks import lakehouse_monitoring as lm

monitor = lm.create_monitor(
    table_name=f"ml.inference.marketing_campaign_simulation",
    profile_type=lm.InferenceLog(
        problem_type="classification",  # We use a RandomForest
        prediction_col="predicted",
        model_id_col="model_version",
        label_col="Response",
        timestamp_col="ts",
        granularities=["1 day"],  # daily inferences
    ),
    output_schema_name=f"ml.inference"
)

Once executed, a back­ground job is cre­ated to gen­er­ate two tables to store the pro­file and drift met­rics from the infer­ence table. A dash­board is also gen­er­ated auto­mat­ic­ally to visu­al­ise the inform­a­tion stored in the above tables. Here we can find inform­a­tion about model per­form­ance and stat­ist­ics, data integ­rity, data drift (cat­egor­ical and numer­ical), data pro­files, and fair­ness and bias. What’s more, you can fil­ter the visu­al­isa­tions by time range, slice value, inspec­tion win­dow, com­par­ison mod­els, gran­u­lar­ity, slice key, and model ID. The visu­al­isa­tions in the Lake­house Mon­it­or­ing Dash­board can be grouped into two types: model per­form­ance drift and data drift.

Fig­ure 6: [Model Per­form­ance Drift] Model stat­ist­ics in the latest window

Fig­ure 7: [Model Per­form­ance Drift] Infer­ences by date (inspec­tion window)

Fig­ure 8: [Model Per­form­ance Drift] Con­fu­sion Mat­rix for inspec­tion window

Fig­ure 9: [Data Drift] Data Qual­ity Checks with respect to nulls and zer­oes in each column in the inspec­tion window

Fig­ure 10: [Data Drift] Data Dis­tri­bu­tion (Line Chart, Box Plot) for Numeric Columns

Fig­ure 11: [Data Drift] Data pro­fil­ing with sum­mary stat­ist­ics by inspec­tion window

Bear in mind that we trans­formed a sub­set of the ori­ginal data to per­form these infer­ences, so the visu­al­isa­tions here might look a bit odd. Nev­er­the­less, it’s easy to see the potential.

The Lake­house Mon­it­or­ing Dash­board enables the quick detec­tion of shifts in our model met­rics and in the inferred data. We can detect vari­ations in data dis­tri­bu­tion with the data integ­rity and data pro­file visu­al­isa­tions, and we can use the model over­view stat­ist­ics to mon­itor our model per­form­ance over time. This drift mon­it­or­ing pro­cess allows us to identify prob­lems that might mean retrain­ing the model.

Con­clu­sions

Dat­ab­ricks is an excep­tional product for devel­op­ing your company’s ML use cases, apply­ing MLOps best prac­tices to pro­duc­tion­ise your model rol­lout strategy, thanks to the seam­less MLflow integ­ra­tion with Dat­ab­ricks. Moreover, fea­tures such as the Fea­ture Store and Lake­house Mon­it­or­ing take ML engin­eer­ing to the next level.

If you are con­sid­er­ing adopt­ing Dat­ab­ricks or if you are think­ing about how to lever­age ML in your com­pany, don’t hes­it­ate to con­tact us. Our team of experts will be thrilled to assist you in unlock­ing the full poten­tial of your data to achieve your busi­ness goals!