What is a CDC Service?

In mod­ern data man­age­ment, Change Data Cap­ture (CDC) is essen­tial to ensure that data across sys­tems is always up to date. CDC ser­vices detect changes made to source sys­tems —such as inserts, updates, and deletes—and cap­ture them in real time, enabling a con­tinu­ous flow of fresh data to tar­get sys­tems without hav­ing to move the entire dataset. 

A best prac­tice in BI data plat­forms is to mir­ror source data to rep­licas in a cloud data ware­house (CDW) or altern­at­ive stor­age solu­tion. This not only allows for real-time data avail­ab­il­ity, but also pre­vents busi­ness users from dir­ectly access­ing raw data in trans­ac­tional sys­tems, thus avoid­ing poten­tial bot­tle­necks or sys­tem slow­downs. By cre­at­ing a layer of rep­licas, busi­nesses can keep their trans­ac­tional sys­tems free from the load of report­ing quer­ies, whilst main­tain­ing up-to-the-minute data syn­chron­isa­tion for ana­lyt­ics. This com­bin­a­tion of data mir­ror­ing and real-time syn­chron­isa­tion is cru­cial to enable the fast, data-driven decisions you need in today’s com­pet­it­ive environment. 

In pre­vi­ous blog posts, we’ve shown you CDC with Snow­flake and NiFi. In this blog post, we’ll explore the CDC Ser­vice from Matil­lion.

What is Matil­lion DPC (Data Pro­ductiv­ity Cloud)? 

Matil­lion DPC, or Data Pro­ductiv­ity Cloud, is a cloud-nat­ive plat­form designed to stream­line the com­plete data life­cycle, from inges­tion to trans­form­a­tion and ana­lyt­ics. Matil­lion integ­rates smoothly with lead­ing cloud data ware­houses like Snow­flake, Amazon Red­shift, Dat­ab­ricks, and Google BigQuery

Matil­lion DPC’s key offer­ings include: 

  • Data Inges­tion: Seam­lessly load data from diverse sources such as data­bases, SaaS applic­a­tions, and APIs. 
  • CDC: Real-time data rep­lic­a­tion, ensur­ing that data across sys­tems remains cur­rent without need­ing to reload entire datasets. 
  • Data Trans­form­a­tion: Lever­age Matillion’s intu­it­ive inter­face to trans­form, enrich, and pre­pare data for advanced analytics. 
  • Orches­tra­tion: Auto­mate and orches­trate work­flows to guar­an­tee smooth and effi­cient pipeline executions.

At the time of writ­ing, the list of sup­por­ted source data­bases is Oracle, IBM Db2, Microsoft SQL Server, Post­gr­eSQL, and MySQL.

Now let’s dive deeper into this service!

Mas­ter­ing Matil­lion CDC Ser­vice: What You Need to Know 

Matillion’s CDC ser­vice is designed to cap­ture changes in data by track­ing the source sys­tem change logs and rep­lic­at­ing them in real time to the tar­get data­base. For instance, ima­gine you’re track­ing cus­tomer orders in an e‑commerce data­base. Whenever an order is placed, updated, or can­celled, CDC cap­tures these changes and instantly mir­rors them in your cloud data ware­house. Instead of reload­ing the entire table, CDC tracks only the modi­fic­a­tions, optim­ising data flow and performance. 

Source: Matillion.com

CDC archi­tec­ture is rel­at­ively straight­for­ward: Matil­lion DPC SaaS is where the CDC agent is cre­ated and man­aged, serving as the con­trol plane as shown above. Here, agents are con­figured to handle data load­ing and pipelines that man­age agent oper­a­tions, schedul­ing, and pro­cessing, all as part of the DPC package. 

The CDC agent, a key com­pon­ent, is a con­tain­er­ised image hous­ing the actual Matil­lion CDC agent soft­ware. This agent cap­tures data changes, pushes them to the tar­get data­base, and updates DPC on its status. Each agent oper­ates as a single entity, man­aging one data­flow at a time, which means that a sep­ar­ate agent con­tainer is required for each CDC pipeline. Once act­ive, the agent appears as “con­nec­ted” on Matil­lion DPC, allow­ing it to be assigned to a spe­cific pipeline and begin stream­ing data. 

The source data­base could be on Vir­tual Private Cloud (VPC), or pub­licly access­ible from any location. 

Cloud stor­age and secret man­age­ment are required to run the agent and are included among the resources gen­er­ated for it. For the use case presen­ted in this blog post, Azure Stor­age Account will be used for stor­age, and Azure Key Vault for secret management. 

In real-world applic­a­tions, mirrored data typ­ic­ally under­goes an ETL pro­cess to tailor it to spe­cific needs, which is why this is rep­res­en­ted as part of the Data Plane (Cus­tomer VPC) in the diagram.

The last piece in the archi­tec­ture, loc­ated out­side your VPC, is the cloud data ware­house; you can choose one of the CDW pro­viders men­tioned at the begin­ning of this article. 

In the fol­low­ing sec­tions, we’ll explore Matil­lion CDC in detail, with a focus on gen­er­at­ing a real-time data flow between an on-premise Post­gr­eSQL data­base and a Snow­flake cloud data ware­house. This prac­tical example will demon­strate how busi­nesses can main­tain con­tinu­ously updated data across envir­on­ments, sig­ni­fic­antly enhan­cing their data ana­lyt­ics capabilities. 

Set­ting up Matil­lion CDC with Azure and Terraform 

Con­fig­ur­ing Matil­lion CDC involves sev­eral steps, espe­cially when set­ting it up in a cloud envir­on­ment like Microsoft Azure. 

In this art­icle, we will set up a CDC pipeline that con­nects a Post­gr­eSQL data­base to a tar­geted Snow­flake instance on Azure using Ter­ra­form, an Infra­struc­ture as Code (IaC) tool that sim­pli­fies the agent cre­ation pro­cess. A com­pre­hens­ive, step-by-step tech­nical guide is avail­able in the Matil­lion doc­u­ment­a­tion

Step 1: Choose Your Cloud Envir­on­ment 

Matil­lion CDC sup­ports vari­ous cloud pro­viders like AWS, Azure, and GCP.  

It also encom­passes deploy­ment tech­no­lo­gies such as ARM, Ter­ra­form, and Kuber­netes

Step 2: Cre­ate a Matil­lion CDC Agent on Matil­lion Data Loader   

On the Matil­lion DPC main page, go to the Data Loader sec­tion, then to the Agents page to cre­ate an agent con­tainer using the wiz­ard. First, type an intu­it­ive name and an optional descrip­tion so that rel­ev­ant inform­a­tion is dis­played when select­ing the agent: 

As men­tioned before, select­ing a cloud pro­vider is the first step in this process. 

You’ll also have to choose the deploy­ment tech­no­logy for agent creation.

At the end of the wiz­ard setup, three import­ant keys will appear: 

  • Agent ID: This unique iden­ti­fier for the agent is one of the envir­on­ment vari­ables you will define in your Ter­ra­form variables. 
  • Organ­iz­a­tion ID: This iden­ti­fier for the organ­isa­tion to which the agent belongs is also an envir­on­ment vari­able to be set in your Ter­ra­form configuration. 
  • Web­Socket End­point: This end­point enables com­mu­nic­a­tion between the agent and the Matil­lion DPC instance, and it too must be defined as an envir­on­ment vari­able in Terraform.

And if you haven’t already gen­er­ated it, you’ll need a pub­lic-private key pair to authen­tic­ate the agent to the DPC instance. Note that the key is dis­played only once. To avoid issues with authen­tic­a­tion, espe­cially if man­aging mul­tiple agents, store this unique private key in a secure location. 

Save all agent con­fig­ur­a­tion details in a file on your PC, as they will be required for the Ter­ra­form tem­plate. Remem­ber that if you cre­ate mul­tiple agents for the same organ­isa­tion within the same DPC instance, only the agent IDs will dif­fer; all other con­fig­ur­a­tion details will remain the same. 

Step 3: Setup & Con­fig­ure Matil­lion CDC Agent 

To begin util­ising the Matil­lion CDC agent in your envir­on­ment, you must set up the CDC agent con­tainer. As it is designed to oper­ate in a sep­ar­ate cloud envir­on­ment, some con­fig­ur­a­tion is required to ensure a smooth and secure integ­ra­tion with Azure. You’ll need the fol­low­ing resources: 

  1. Resource Group: Cre­ate an Azure resource group to wrap all the resources related to the CDC Agent.
  2. Ser­vice Prin­cipal: Register an app with access to any ten­ant in the organ­isa­tion (Azur­eAD­Mul­tiple­Orgs) and enable ID token requests using OAuth 2.0 impli­cit flow (id_token_issuance_enabled). The CDC agent will use this ser­vice prin­cipal to inter­act with Azure resources.
  3. Key Vault: Set up an Azure Key Vault to securely store the private key from the CDC agent con­tainer setup (gen­er­ated in Step 2). Make sure you add the —–BEGIN PRIVATE KEY—– and the  —–END PRIVATE KEY—– tags.
  4. Net­work­ing: Con­fig­ure a vir­tual net­work (VNet) and sub­net with a ser­vice del­eg­a­tion spe­cific­ally for the con­tainer group, enabling it to man­age the sub­net, and a NAT gate­way to facil­it­ate secure com­mu­nic­a­tion with external ser­vices out­side the VNet.
  5. Stor­age Account and Con­tainer: Cre­ate an Azure Stor­age account and con­tainer to stage data dur­ing the cap­ture pro­cess and grant “Stor­age Blob Data Con­trib­utor” to the ser­vice principal.
  6. Con­tainer Group: Deploy a con­tainer group with a min­imum con­fig­ur­a­tion of 4 vCPU and 8 GB RAM to host the pub­lic CDC agent image. Set the fol­low­ing envir­on­ment vari­ables to ensure proper con­nectiv­ity and authen­tic­a­tion with Azure and the CDC platform: 
    • “AZURE_CLIENT_ID”: The ser­vice prin­cipal cli­ent ID from point #2. 
    •  “AZURE_CLIENT_SECRET”: The ser­vice prin­cipal cli­ent secret from point #2. 
    •  “AZURE_SECRET_KEY_VAULT_URL”: The Key Vault URI of point #3. 
    •  “AZURE_TENANT_ID”: The Azure account ten­ant ID. 
    •   “ID_ORGANIZATION”: The unique UUID that iden­ti­fies your organ­isa­tion to the CDC Plat­form (cre­ated in Step 2). 
    • “ID_AGENT”: The unique UUID that iden­ti­fies your agent to the CDC plat­form, also cre­ated in Step 2. 
    • “PLATFORM_KEY_NAME”: The name of the Key Vault secret that stores the private key (Step 2). 
    • “PLATFORM_KEY_PROVIDER”: The “azure-key-vault” value. 
    • “SECRET_PROVIDERS”: The “azure-key-vault:1” value. 
    • “PLATFORM_WEBSOCKET_ENDPOINT”: The Web­Socket end­point for the CDC plat­form, provided in Step 2.
  7. Log Ana­lyt­ics Work­space: To mon­itor and debug the agent run­ning in the con­tainer group, stream­lin­ing troubleshoot­ing and provid­ing insights into the agent’s performance.

Step 4: Deploy­ing Matil­lion CDC Agent via Ter­ra­form 

Now that the agent has been cre­ated in Matil­lion DPC and the required infra­struc­ture is ready, it’s time to deploy the agent. While you can deploy the Ter­ra­form code manu­ally via the Ter­ra­form CLI, doing so intro­duces the risk of human error and repe­ti­tion. Instead, we recom­mend a CI/CD pipeline as a best prac­tice, provid­ing a secure and effi­cient method to auto­mate the deploy­ment pro­cess. By using a CI/CD pipeline along with a ser­vice prin­cipal, you avoid using elev­ated user per­mis­sions, enhan­cing secur­ity by elim­in­at­ing the need for admin-level access to deploy resources. Fur­ther­more, auto­mat­ing through CI/CD sig­ni­fic­antly reduces the chance of human error and accel­er­ates deployment. 

Start by cre­at­ing an app regis­tra­tion with the Application.ReadWrite.OwnedBy API per­mis­sion, allow­ing you to estab­lish the neces­sary ser­vice prin­cipal. Addi­tion­ally, assign the Con­trib­utor and Role Based Access Con­trol Admin­is­trator roles at the sub­scrip­tion level to per­mit resource cre­ation and access man­age­ment; these per­mis­sions will be ref­er­enced in the Ter­ra­form azurerm pro­vider configuration. 

Please note that for this spe­cific use case, two sep­ar­ate app regis­tra­tions and their respect­ive ser­vice prin­cipals are required: one to deploy Ter­ra­form, and the other to enable the CDC agent to con­nect with Azure. The required ser­vice prin­cipal is cre­ated with Ter­ra­form itself, as men­tioned in Step 2 resource #2. How­ever, the app regis­tra­tion that actu­ally runs Ter­ra­form must be cre­ated manu­ally, with the neces­sary per­mis­sions gran­ted beforehand. 

The Application.ReadWrite.OwnedBy per­mis­sion is essen­tial because it enables Ter­ra­form to cre­ate the second app regis­tra­tion (required by the CDC agent). After the manual setup of the primary ser­vice prin­cipal and the assign­ment of neces­sary roles, you can pro­ceed with the auto­mated deploy­ment process. 

The CI/CD steps are simply: 

  1. Install Ter­ra­form.
  2. Run “ter­ra­form init” to ini­tial­ise the pro­viders and the backend. 
  3. Run “ter­ra­form plan” to plan the creation/modification/deletion of resources. 
  4. Run “ter­ra­form apply” to deploy resources only when mer­ging to the main branch. 

We used Git­Hub Actions to auto­mat­ic­ally deploy the Ter­ra­form code from our Git­Hub repository: 

Once Ter­ra­form has cre­ated the resources and they are up and run­ning, you will see the new agent lis­ted on the Agent Con­tain­ers screen in Matil­lion DPC with a Con­nec­ted status. 

When its status changes to “con­nec­ted”, the but­ton “Add pipeline” will be enabled. In the sub­sequent stages the steps to cre­ate a new pipeline and assign this agent to the pipeline, will be explained.

Step 5: Pre­pare the Source Data­base 

For CDC to func­tion, most CDC ser­vices require enabling source data­base logs to cap­ture real-time changes, and in our case Matil­lion CDC requires these logs to be enabled. Con­fig­ur­ing this will also require a user with elev­ated per­mis­sions: simple read/write access isn’t enough. 

What’s more, you’ll need to con­fig­ure the source data­base server. For Post­gr­eSQL the WAL (Write-Ahead Log) set­tings should be con­figured to meet CDC require­ments. The key para­met­ers wal_level, max_wal_senders, and max_replication_slots should be set accord­ing to the doc­u­ment­a­tion. For on-demand snap­shot­ting you’ll also have to con­fig­ure a sig­nal table for Matil­lion CDC. In our case, we didn’t need on-demand snap­shot­ting, so we skipped these settings. 

Step 6: Pre­pare Tar­get Data­base 

The final step in set­ting up the CDC agent involves pre­par­ing the tar­get data­base – in our case, Snow­flake. Once access to the Snow­flake account with the neces­sary per­mis­sions has been con­firmed (refer to the doc­u­ment­a­tion for more detail), you will need to cre­ate a STAGE using TYPE=AVRO: 

A reg­u­lar stage is cre­ated with CSV which may lead to errors dur­ing setup; spe­cify­ing TYPE=AVRO ensures com­pat­ib­il­ity for CDC data inges­tion. For addi­tional inform­a­tion on con­fig­ur­ing Snow­flake stages, please refer to their doc­u­ment­a­tion

Step 7: Cre­ate Data Loader Pipeline

Once the agent setup has been com­pleted suc­cess­fully, and source and tar­get data­bases con­figured, you can cre­ate the pipeline to man­age the data load. In the Matil­lion CDC con­sole, use the stand­ard wiz­ard to:

  • Select the cor­res­pond­ing source data­base: Choose Post­gr­eSQL as your source database.
  • Select the Data Inges­tion mode: 

Matil­lion offers two modes: CDC and Batch Load­ing. While batch load­ing doesn’t require a con­tinu­ously run­ning agent (as it doesn’t cap­ture real-time changes), CDC does. For batch load­ing, you only need to spe­cify the con­nec­tion set­tings and set the fre­quency of data inges­tion. How­ever, for our use case, we’ll select CDC to cap­ture ongo­ing changes in real time. 

  • Choose the Agent: This step is optional if you have already clicked on the Add pipeline but­ton on the Agent Con­tain­ers page). 
  • Set the Source:  
    • Define the server details, user cre­den­tials, and data­base info to replicate. 
    • Define the schemas and tables you want to mon­itor for data changes. 
    • Store the pass­word securely in Azure Key Vault and define the secret name there
  • Choose the Schemas to track for changes: 
  • Choose the Tables to track for changes: 
  • Set up the Des­tin­a­tion
    • Select the cloud pro­vider; we’ll be using Snow­flake for this example. 
    • Provide the Snow­flake account details: 
    • Define the pass­word in Azure Key Vault and input the cor­res­pond­ing secret name there. 
    • Con­fig­ure addi­tional details, includ­ing the Role, Ware­house, Data­base, Stage, and Table that the agent should use; add pre­fixes for tables cre­ated dur­ing data capture. 
  • Con­fig­ure the Pipeline: Name your new pipeline and define the snap­shot set­tings, trans­form­a­tion types, and select a strategy for date and time processing. 
  • Con­firm the Sum­mary: Before final­ising, review the summary. 

Once the pipeline has been cre­ated, it pro­gresses through vari­ous phases and should sta­bil­ise with a Stream­ing status. You can mon­itor the status on the Matil­lion Data Loader > Pipelines page: 

Real-time Data Replication 

To test Matil­lion CDC’s real-time cap­ab­il­it­ies, we’ll run a proof-of-concept by apply­ing three dif­fer­ent DML state­ments — INSERT, UPDATE, and DELETE — and observe how accur­ately and quickly Matil­lion CDC rep­lic­ates these changes. Before run­ning any SQL state­ments, let’s exam­ine the source and tar­get tables: 

Post­gr­eSQL: 

Snow­flake: 

Next, we’ll apply the fol­low­ing DML state­ments to the source table in PostgreSQL: 

  • UPDATE the cus­tomer name and product name in OrderID 7. 
  • DELETE OrderID 8. 
  • INSERT a new row with OrderID 9. 

On check­ing the tar­get DWH table in Snow­flake, we can con­firm that all changes have been cap­tured instant­an­eously and accurately: 

Return­ing to the Matil­lion CDC ser­vice UI, we can see these updates reflec­ted on the dashboard: 

Con­clu­sion 

Matil­lion Change Data Cap­ture plays a crit­ical role in ensur­ing that data is always up to date across sys­tems, enabling busi­nesses to lever­age real-time data for faster and bet­ter-informed decision-mak­ing. In this art­icle, we’ve explored how to con­fig­ure Matil­lion CDC using Azure and Ter­ra­form, focus­ing on rep­lic­at­ing changes from a Post­gr­eSQL data­base to Snow­flake. With CDC, busi­nesses avoid dir­ect access to trans­ac­tional sys­tems, reduce sys­tem strain, and ensure that BI plat­forms are always oper­at­ing with the most cur­rent data. 

After con­duct­ing fur­ther tests, we’ve shown that the tool is not only reli­able, but fast and easy to use as well. Even when hand­ling lar­ger source tables with thou­sands or mil­lions of changes to be cap­tured, per­form­ance remained impress­ive. We observed rapid response times and seam­less syn­chron­isa­tion, even under the pres­sure of high-volume data changes, rein­for­cing Matil­lion CDC’s role as a robust solu­tion for real-time data rep­lic­a­tion needs. 

As Matil­lion con­tin­ues to con­sol­id­ate its cloud SaaS offer­ing, with CDC evolving into an even more power­ful stream­ing ser­vice, excit­ing updates are on the hori­zon. Stay tuned for more devel­op­ments! Here at ClearPeaks, our team of experts is ded­ic­ated to craft­ing solu­tions that align per­fectly with your busi­ness needs. With our extens­ive and proven exper­i­ence in the latest data tech­no­lo­gies, we can guar­an­tee that your data remains accur­ate, reli­able, and primed to drive smart decision-mak­ing. If you’re look­ing to optim­ise your data man­age­ment strategy, don’t hes­it­ate to reach out – our spe­cial­ist con­sult­ants are here to help!