Optim­iz­ing an Air­flow DAG devel­op­ment lifecycle



Synvert mis­sion is to trans­form teams into high-per­form­ing organ­iz­a­tions. In one of our pro­jects accel­er­at­ing an organ­iz­a­tion’s per­form­ance, we have taken on the respons­ib­il­ity of man­aging the data infra­struc­ture for a lead­ing auto­maker to improve its devel­op­ment speed and qual­ity. This art­icle shows you how we solved one of the pain points we iden­ti­fied, lever­aging the elasti­city of the cloud, a DevOps mind­set, and cloud-nat­ive capabilities.

The scope and primary object­ive were to fetch and pro­cess data used to pop­u­late vehicle inform­a­tion data to be con­sumed by dif­fer­ent APIs. This involved a data pipeline pro­cess, where the data had to be extrac­ted, trans­formed, val­id­ated, and enriched based on a set of pre­defined rules.

Air­flow was the work­flow man­age­ment tool, which served as the cent­ral com­pon­ent of the data pipeline. In Air­flow, each work­flow is rep­res­en­ted as a Dir­ec­ted Acyc­lic Graph (DAG) writ­ten in Python. Although the tech­nical details of Air­flow are bey­ond the scope of this art­icle, it suf­fices to say that it plays a cru­cial role in ensur­ing the data qual­ity and smooth oper­a­tion of the sys­tem. It fetches data, pro­cesses it, and ensures that other com­pon­ents receive the data in the required format, being a key com­pon­ent of the data infrastructure.

We real­ized there was a chal­lenge in the DAG’s devel­op­ment life­cycle and we took it as an oppor­tun­ity to dra­mat­ic­ally increase its speed, qual­ity, and developer experience.

If you’re inter­ested in know­ing how we improved Lead Time For Change by a factor of 3 while at the same time improv­ing Change Fail­ure Rate, keep reading!

Deep into the challenge

By hav­ing a look at the soft­ware devel­op­ment life­cycle of the DAG’s we iden­ti­fied a big bottleneck.

The team was fol­low­ing a Git­Flow approach with fea­ture branch­ing. Each developer cre­ated his own fea­ture branch and only after mer­ging and deploy­ing to the E2E test envir­on­ment, QA Engin­eers could start the feed­back loop with engineers.

Here’s a very simple overview:

Bottleneck on the E2E and integration environment of the SDLC
Fig­ure 1: Bot­tle­neck on the E2E and integ­ra­tion envir­on­ment of the SDLC
  1. Ana works on their local devel­op­ment envir­on­ments and get dir­ect feed­back from unit tests and does first val­id­a­tions run­ning DAG’s against mocked data
  2. A pull request is opened by Ana for John to review and if everything is ok, it’s deployed to the end 2 end test environment
  3. QA engin­eers now can per­form qual­ity checks against more com­plete data on the End 2 End test environment
  4. John has his work com­pleted, but it can’t be deployed / it’s risky to deploy to the End 2 End test envir­on­ment since QA engin­eers only have one single shared environment

The exist­ing setup had two major issues.

1) It was under­op­tim­ized for the avail­able resources

QA engin­eers could be doing more qual­ity checks but were rely­ing on and were blocked by the wor­ries (described below) that one single shared instance to per­form E2E tests cre­ate. Need­less to say that this was a huge hassle and it was drag­ging the entire release process.

2) It was affect­ing the val­id­a­tion quality

When per­form­ing any kind of qual­ity check on the DAGs, test data man­age­ment and ensur­ing a valid ini­tial state of the envir­on­ment is crit­ical to ensure the qual­ity of the tests to avoid any false errors and takeaways out of incor­rect start­ing state. A lot of coordin­a­tion and com­mu­nic­a­tion was needed between qual­ity engin­eers and even developers in order for this to work, mak­ing it bor­der­line impossible to manage.

In order to enhance our qual­ity check pro­cess, we have recog­nized the need for isol­ated envir­on­ments where QA engin­eers can run end-to-end (E2E) tests on spe­cific DAGs without affect­ing the shared test­ing envir­on­ment. For example, when test­ing a DAG that cleans up data­base data, it is not ideal to per­form these tests in a shared envir­on­ment where other indi­vidu­als might be rely­ing on the data­base state for their own tests. Although backups from snap­shots are pos­sible, it would be advant­age­ous to have ded­ic­ated, repro­du­cible and eph­em­eral envir­on­ments with proper test data man­age­ment for QA test­ing purposes.

Shift­ing qual­ity left with APEs

To address this, we have decided to enable QA engin­eers to start hav­ing an impact in the Pull Request stage, or even before! Allow­ing QA engin­eers to per­form E2E tests as early as pos­sible in the devel­op­ment pro­cess. We did this by launch­ing isol­ated envir­on­ments for each feature/DAG, allow­ing the QA team to tar­get spe­cific tests without impact­ing other ongo­ing tests. This approach seam­lessly integ­rates E2E tests into our exist­ing Git­Flow without the need for any sig­ni­fic­ant changes to the exist­ing processes.

We called these envir­on­ments Air­flow Pre­view Envir­on­ments (APE).

QA Engineers contribute to earlier steps of the development lifecycle
Fig­ure 2: QA Engin­eers con­trib­ute to earlier steps of the devel­op­ment lifecycle
Air­flow Pre­view Envir­on­ments implementation

In a nut­shell, an Air­flow Pre­view Envir­on­ment needed to be:

  • Isol­ated
  • Eph­em­eral
  • Repro­du­cible
  • Determ­in­istic
  • Fully auto­mated
  • Fully packed with everything we need for E2E/integration testing

Our goal was to spin up a min­imal oper­a­tional Air­flow envir­on­ment isol­ated from everything else, to be used on demand and then deleted when the DAG has been through the QA engin­eers’ feed­back loop.

How did we integ­rate them into the devel­op­ment lifecycle?

Git Flow enhanced by APE's
Fig­ure 3: Git Flow enhanced by APE’s
  1. A new branch (feature1) is opened from our main branch (default)
  2. A team mem­ber works on his feature1 branch
  3. It then opens a Pull Request (PR) to our main branch
  4. The Pull Request auto­mat­ic­ally trig­gers the spin-up of one ded­ic­ated APE
  5. A new envir­on­ment is auto­mat­ic­ally cre­ated and the developers can eas­ily share the spun-up APE with the QA Engineer
  6. Once the PR is closed, the APE is auto­mat­ic­ally destroyed
Tools and tech­no­lo­gies used?

Here’s an over­view of the infra­struc­ture stack and tech­no­lo­gies, and how we used them to achieve our goal

  • Kuber­netes — as our orches­tra­tion tool. This was the orches­tra­tion tool already used for all other envir­on­ments (dev, sta­ging, pro­duc­tion) so it made sense to use it to also deploy our APE.
  • Helm — For man­aging Kuber­netes-based apps
  • Git­Hub Actions — All our CI/CD pipelines ran on top of Git­Hub Actions
  • AWS Secrets Man­ager — To securely man­age the secrets needed
GitHub Actions trigger APE launch
Fig­ure 4: Git­Hub Actions trig­ger APE launch

Let’s now take a look at the Git­Hub Action respons­ible for the life­cycle of an Air­flow Pre­view Environment.

1. Defin­ing the trig­gers to launch the APE

First, we defined the trig­gers. We want it to run for every PR that is opened or re-opened to our default branch as we see in the above image.

on:
  pull_request:
    types: [opened, reopened]
    branches:
      - default

To ensure proper namespace nam­ing for launch­ing isol­ated envir­on­ments, we fol­low a spe­cific pro­cess. We start with a pre­defined string, “air­flow-pre­view,” and append the name of the ori­gin branch of the Pull Request (PR). How­ever, Kuber­netes imposes a lim­it­a­tion of 63 char­ac­ters for namespace names. In some cases, the con­cat­en­ated name exceeds this limit.

To address this, we have imple­men­ted a job that handles the namespace nam­ing. This job intel­li­gently trun­cates the name to a length that still uniquely iden­ti­fies it within the entire cluster. The out­put of this job provides the mod­i­fied namespace name, which is then used by the deploy job.

By imple­ment­ing this solu­tion, we guar­an­tee that our namespace names adhere to the char­ac­ter limit while still main­tain­ing unique­ness within the cluster.

jobs:
  curate-name:
    runs-on: 'self-hosted'
    steps:
    - name: Curate environment name
      id: curate_name
      run: |
        final_env_name=$(echo airflow-preview-${{ github.event.pull_request.head.ref }} | cut -d "-" -f1-4)
        echo "::set-output name=env_final_name::$final_env_name"    

    outputs:
      env_curated_name: ${{ steps.curate_name.outputs.env_final_name }}
2. Deploy­ing the com­pon­ents and mak­ing the envir­on­ment accessible

The next step is to deploy the com­pon­ents. Since we have these spe­cific charts hos­ted in the same repos­it­ory we first check­out the code to then deploy the secrets chart fol­lowed by the Air­flow chart. We do this using by using the bitovi/­git­hub-actions-deploy-eks-helm action. You can check the ref­er­ence for this action here.

deployment:
    runs-on: 'self-hosted'
    needs: [curate-name]
    environment: 
      name: ${{ github.event.pull_request.head.ref }}
      url: https://${{ needs.curate-name.outputs.env_curated_name }}.your-domain.com
    steps:
    - uses: actions/checkout@v3

    - name: Deploy Airflow Secrets Helm Chart
      uses: bitovi/github-actions-deploy-eks-helm@v1.1.0
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: ${{ secrets.AWS_REGION }}
        cluster-name: ${{ secrets.CLUSTER_NAME }}
        config-files: helm/airflow-secrets/values.yaml
        chart-path: ./helm/airflow-secrets
        namespace: ${{ needs.curate-name.outputs.env_curated_name }}
        name: airflow-secrets
        version: 1.0.0
        values: aws.access_key_id=${{ secrets.AWS_ACCESS_KEY_ID }},aws.secret_access_key=${{ secrets.AWS_SECRET_ACCESS_KEY }},git.ssh_key=${{ secrets.AIRFLOW_GIT_SSH_KEY }}
        timeout: 60s
        atomic: true

    - name: Deploy Airflow Helm Chart
      uses: bitovi/github-actions-deploy-eks-helm@v1.1.0
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: ${{ secrets.AWS_REGION }}
        cluster-name: ${{ secrets.CLUSTER_NAME }}
        config-files: helm/airflow_values_preview.yaml
        chart-path: apache-airflow/airflow
        namespace: ${{ needs.curate-name.outputs.env_curated_name }}
        name: airflow2
        chart-repository: https://airflow.apache.org/
        version: 1.6.0
        values: ingress.web.hosts={${{ needs.curate-name.outputs.env_curated_name }}.your-domain.com},dags.gitSync.branch=${{ github.event.pull_request.head.ref }}
        timeout: 360s
        atomic: true

We used Git­Hub Secrets to safely use some val­ues with our Helm charts. For example, our APE is syn­chron­iz­ing the DAGs dir­ectly from a Git repos­it­ory, and, in order to do that, it requires an SSH Key that is stored in Git­Hub Secrets. You can check the field git.ssh_key in the val­ues entered in the “Deploy Air­flow Secrets Helm Chart” step of the above snippet.

In this job, we also cre­ate a Git­Hub envir­on­ment so that the developers have a dir­ect link to the APE in the Git­Hub UI.

Environment feature-1-add-functionality set to active
Fig­ure 5: Envir­on­ment fea­ture-1-add-func­tion­al­ity set to act­ive
3. Ensur­ing secur­ity with proper secrets management

Air­flow allows you to store con­nec­tions (to an AWS S3 Bucket, for example) and vari­ables. You can do this manu­ally or load them via JSON file for example. By using the Air­flow Secrets Man­ager backend we can have Air­flow con­sum­ing these val­ues dir­ectly from AWS without any “trails” of them in the cluster. This is where the AWS Secrets Man­ager comes in handy.

If you use the AWS Secrets man­ager backend don’t for­get to pass the AWS access secrets like we are doing on the Deploy Air­flow Secrets job.

backend_kwargs = {"connections_prefix": "airflow/connections", "variables_prefix": "airflow/variables", "profile_name": "default"}

With this new envir­on­ment deployed the QA engin­eer can safely run their qual­ity tests for as long as they want without being con­cerned about impact­ing the work of another peer.

4. Clean­ing up

Once the qual­ity check is approved by the QA Engin­eer and the code is merged, the pull request is closed auto­mat­ic­ally trig­ger­ing the clean-up flow of the Air­flow Pre­view Environment.

Now we need to clean up the Kuber­netes namespace we cre­ated and the Git­Hub envir­on­ment that we don’t need anymore.

- name: Deleting namespace
      uses: ianbelcher/eks-kubectl-action@master
      with:
        aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws_region: ${{ secrets.AWS_REGION }}
        cluster_name: ${{ secrets.CLUSTER_NAME }}
        args: delete ns ${{ needs.curate-name.outputs.env_curated_name }}

    - name: Deleting environment
      run: |
        curl -X DELETE -H "Accept: application/vnd.github+json" -H "Authorization: Bearer ${{ secrets.AUTH_TOKEN }}" [your-repo-url]/api/v3/repos/your-org/your-repo/environments/${{ github.event.pull_request.head.ref }}

To delete the namespace we use the ian­belch­er­/eks-kubectl-action action and, for the envir­on­ment dele­tion, we per­form a simple curl DELETE request to the Git­Hub API where we provide the name of the envir­on­ment to erase.

Final thoughts

At synvert, our mis­sion is to empower teams and trans­form them into high-per­form­ing organ­iz­a­tions. Through our col­lab­or­a­tions with prom­in­ent com­pan­ies and teams, such as the one described in this art­icle, we have reaf­firmed our belief that suc­cess lies in pay­ing atten­tion to the finer details. By imple­ment­ing a seem­ingly simple adjust­ment, lever­aging the elasti­city of the cloud, and adopt­ing cloud-nat­ive prac­tices, we enabled our cus­tom­ers to achieve sub­stan­tial improve­ments in vari­ous cru­cial areas.

Not­ably, this modi­fic­a­tion had a sig­ni­fic­ant impact on the speed, qual­ity, and developer exper­i­ence of the team involved in the pro­ject. By redu­cing the feed­back loop and min­im­iz­ing the lead time for changes, our cli­ent exper­i­enced enhanced agil­ity, which in turn led to improved busi­ness outcomes.

We under­stand that even the smal­lest tweaks can bring about sub­stan­tial bene­fits, and we strive to help organ­iz­a­tions unlock their full poten­tial by focus­ing on these crit­ical ele­ments. Through our efforts, we aim to sup­port teams in achiev­ing remark­able effi­ciency, pro­ductiv­ity, and ulti­mately, remark­able success.