Imple­ment­ing Change Data Cap­ture (CDC) in Snowflake



A data ware­house is a rela­tional data­base designed more for quer­ies, data ana­lysis, decision-mak­ing and Busi­ness Intel­li­gence-type activ­it­ies than for pro­cessing trans­ac­tions or other tra­di­tional data­base uses.

The inform­a­tion stored in a data ware­house is his­tor­ical and provides an over­view of the dif­fer­ent trans­ac­tions that have taken place over time. Redund­ant data is often included in data ware­houses to provide users with mul­tiple views of information.

There are many data ware­house vendors in the mar­ket, but Snow­flake is a mod­ern data ware­house that has become an industry-lead­ing cloud-based SaaS (Soft­ware-as-a-ser­vice) data platform.

Nowadays, data fre­quently changes in oper­a­tional sys­tems, but the data ware­house often doesn’t mir­ror this real-time updat­ing, as the pro­cess of load­ing up-to-date inform­a­tion occurs at spe­cified inter­vals rather than instantaneously.

Tra­di­tion­ally, busi­nesses used batch-based approaches to move data once or sev­eral times a day, but this leads to latency and reduces the oper­a­tional value of the data to the organisation.

Change Data Cap­ture (CDC) has emerged as an ideal solu­tion for near real-time data move­ment from rela­tional data­bases (like SQL Server or Oracle) to data ware­houses, data lakes or other data­bases, and can also work as batch pro­cessing (depend­ing on how the job schedul­ing is configured).

In this art­icle we will talk about Snow­flake CDC and see how it keeps you updated with cur­rent data. But before get­ting star­ted, let’s take a quick look at Snow­flake and its most sali­ent features.

What is Snowflake?

When it launched in Octo­ber 2014, Snow­flake was the first data ware­house solu­tion designed to be delivered in the cloud, and it was called Snow­flake Elastic Data Warehouse.

The idea was to offer users a solu­tion, in the cloud, which would bring together all the data and task pro­cesses in a single data ware­house, whilst guar­an­tee­ing good per­form­ance in data pro­cessing, flex­ib­il­ity in stor­age, and ease of use.

Snow­flake star­ted with the fol­low­ing value propositions:

  • Data ware­housing as a ser­vice: Thanks to the cloud, Snow­flake avoids typ­ical infra­struc­ture admin­is­tra­tion and data­base man­age­ment prob­lems. As with DBaaS, users can focus on pro­cessing and stor­ing data, and by get­ting rid of the phys­ical infra­struc­ture, costs go down; and of course, it can always be adap­ted to the customer´s needs.
  • Mul­ti­di­men­sional elasti­city: Unlike other products on the mar­ket at the time, Snow­flake could scale up stor­age space and com­put­ing power inde­pend­ently for each user, mak­ing it pos­sible to load data while run­ning requests without hav­ing to sac­ri­fice per­form­ance, as resources are dynam­ic­ally alloc­ated accord­ing to the needs of the moment.
  • Single stor­age des­tin­a­tion for all data: Snow­flake allows all the company’s struc­tured and semi-struc­tured data to be stored cent­rally, so those wish­ing to work with this data can access it in a single sys­tem, without the need for prior pro­cessing before they start their ana­lyt­ical tasks.

Snowflake´s Unique Architecture

A Hybrid Archi­tec­ture of Shared Disk and Share Nothing

Snow­flake makes large-scale data stor­age and ana­lysis pos­sible thanks to its innov­at­ive archi­tec­ture. Exclus­ively a cloud product, Snow­flake is based on vir­tual com­put­ing instances, such as AWS Elastic Cloud Com­pute (EC2) for cal­cu­la­tion and ana­lysis oper­a­tions, in addi­tion to a stor­age ser­vice like AWS Simple Stor­age Ser­vice (S3) to store the data.

As with any data­base, a Snow­flake cluster has stor­age resources (or disk memory), RAM, and CPU com­put­ing power. Snow­flake is based on a hybrid archi­tec­ture, blend­ing a shared disk model with an isol­ated archi­tec­ture known as Share Noth­ing Architecture.

On the one hand, all the data stored in the Snow­flake data ware­house is con­sol­id­ated into a single dir­ect­ory, like shared disk archi­tec­tures, and is access­ible to all the com­pu­ta­tion nodes in the cluster.

On the other hand, requests made on Snow­flake using MPP (Massively Par­al­lel Pro­cessing) cal­cu­la­tion clusters are pro­cessed by nodes where each cluster con­tains only a por­tion of the data present in the data warehouse.

Based on these two approaches, Snow­flake offers the sim­pli­city of data man­age­ment thanks to its cent­ral­ised data space, com­bined with the per­form­ance of a Share Noth­ing archi­tec­ture for quer­ies on the data that the ware­house contains.

Snowflake´s Three Layers:

The Snow­flake data ware­house com­prises 3 layers:

  • Data stor­age
  • Request pro­cessing
  • Cloud ser­vices
Snowflake 3 layers
Fig­ure 1: The Snow­flake data warehouse´s three lay­ers: cloud ser­vices, request pro­cessing and data storage.

When data is loaded into your Snow­flake ware­house, it com­presses it, reor­gan­ises it into a column format, and enriches it with metadata and stat­ist­ics. Raw data will no longer be access­ible dir­ectly, but only via quer­ies (SQL, R or Python) made through Snowflake.

Snow­flake also incor­por­ates a pro­cessing layer to handle quer­ies on the data. Data quer­ies are executed on vir­tual ware­houses (or vir­tual data ware­houses). Each vir­tual ware­house is an MPP cluster oper­at­ing on a Share Noth­ing archi­tec­ture, with mul­tiple nodes, each stor­ing a spe­cific por­tion of the data warehouse’s over­all data.

Each vir­tual ware­house can pro­cess a mul­ti­tude of sim­ul­tan­eous requests, and the com­put­ing cluster can be scaled up or down depend­ing on the work­load. Import­antly, the dif­fer­ent vir­tual ware­houses do not share any resources, be it com­put­ing power, memory, or stor­age: this design pre­vents resource con­flicts and mit­ig­ates com­pet­i­tion among requests for the same data.

Finally, cloud ser­vices form the top layer of the Snow­flake infra­struc­ture, facil­it­at­ing seam­less coordin­a­tion across the data ware­house frame­work. These ser­vices allow users to authen­tic­ate, launch or optim­ise data quer­ies, admin­is­ter clusters, and lever­age many other features.

Data Pro­tec­tion in Snowflake

Snow­flake ensures the integ­rity of its hos­ted data through two key func­tion­al­it­ies, Time Travel and Fail-safe.

Time Travel enables the pre­ser­va­tion of data integ­rity by main­tain­ing its state for the entire spe­cified period, even after modi­fic­a­tions have been made. Lim­ited to a single day of his­tory in the stand­ard ver­sion, Time Travel can be con­figured for up to 90 days with the Snow­flake Enter­prise licence, and allows the res­tor­a­tion of pre­vi­ous states for tables, schemas, or entire databases.

The Fail-safe fea­ture offers a 7‑day backup once the Time Travel period has ended, in order to recover data that might have been cor­rup­ted by errors dur­ing operations.

Both fea­tures help to main­tain data integ­rity and also con­trib­ute to the billed stor­age capa­city of the Snow­flake cluster.

What is Change Data Cap­ture (CDC)?

graphic showing how Change Data Capture works

CDC is an ideal solu­tion to cap­ture the near real-time data move­ments in data­bases. It encom­passes a set of soft­ware design pat­terns used to detect and track data changes. It trig­gers an event con­nec­ted with data, caus­ing a spe­cific action to be executed for each CDC occur­rence. In the con­text of data oper­a­tions, the acquis­i­tion of real-time data streams is imper­at­ive for the effi­cient exe­cu­tion of ana­lyt­ics by any organ­isa­tion, and CDC provides near real-time move­ment of data, pro­cessing it as soon as new data­base events occur.

Events are cap­tured and streamed in real time using CDC, help­ing to achieve reli­able, scal­able, and low-latency data rep­lic­a­tion within high-velo­city data envir­on­ments. It elim­in­ates the need for bulk data load­ing by imple­ment­ing an incre­mental data load­ing pro­ced­ure, so data ware­houses or data­bases remain act­ive in order to execute spe­cific actions as soon as a CDC event occurs. What´s more, com­pan­ies can send fresh data updates to BI tools and team mem­bers almost instantly with CDC!

What is Snow­flake CDC and How Does It Work?

In today’s data-driven eco­nomy the inform­a­tion in your sys­tems changes fre­quently, and it would be a com­plex task to load it fully into Snow­flake every time it does, using up resources and money. This is where Snow­flake Change Data Cap­ture comes into play: you can imple­ment CDC in Snow­flake effect­ively with just a few com­mands, and this is made pos­sible by the concept of streams in Snowflake.

A Snow­flake stream object basic­ally tracks all DML changes made to rows in a source table and stores the metadata of each change. This metadata, occur­ring between the two trans­ac­tional time points in a table, is used later to retrieve the changed data.

What is a Snow­flake Stream?

A stream in Snow­flake, also known as a table stream, is an object that records DML changes made to a source object. It uses the metadata asso­ci­ated with those changes to enable oper­a­tions on the changed data. A stream can provide a basic set of changes by lever­aging the off­set from its present place­ment to the cur­rent ver­sion of the table. When quer­ied, a stream will return the his­toric data with the struc­ture and nomen­clature of the source object, along­side addi­tional columns that provide more inform­a­tion about the type of change.

Let’s look at the addi­tional columns returned as part of the res­ult of the query to a stream.

METADATA$ACTION: The value of this column indic­ates the kind of DML oper­a­tion that was recor­ded; the val­ues are INSERT or DELETE. Note that the UPDATE state­ment is rep­res­en­ted with two-row entries for INSERT and DELETE.

METADATA$ISUPDATE: This column indic­ates whether the row entry was part of an UPDATE state­ment. The value is TRUE for entries that were part of an UPDATE, and FALSE otherwise.

METADATA$ROW_ID: This spe­cifies the unique and immut­able ID for a row and can be used to track changes on a par­tic­u­lar row over time.

What are the Types of Streams Avail­able in Snowflake?

There are three stream types in Snow­flake: stand­ard, append-only, and insert-only; each serves a dif­fer­ent pur­pose depend­ing on your use case.

Stand­ard: The stand­ard stream type is sup­por­ted for streams on tables, dir­ect­ory tables, and views. It mon­it­ors all DML changes in the source object (inserts, updates, and deletes), and also tracks table trun­ca­tions. This stream type returns the net change in a row and does this by per­form­ing a join on the inser­ted and deleted rows within a change set. What this means, for example, is that if a row is inser­ted and then deleted between two trans­ac­tional time points, a stand­ard stream will not return it as part of the query. Because of this, a stand­ard stream is also called a “delta stream” as it returns the net out­come of trans­ac­tions executed on the data.

CREATE OR REPLACE STREAM my_stream ON TABLE my_table;

Append-only: Append-only streams exclus­ively track row inserts. Updates and deletes, includ­ing table trun­ca­tions, are ignored by an append-only stream. For example, if 5 rows are inser­ted into a source object and 2 rows are deleted, a query to an append-only stream will return all 5 rows. Append-only streams are more per­form­ant than stand­ard streams as they con­sume fewer resources as they only track inserts. Append-only is applic­able to streams asso­ci­ated with stand­ard tables, dir­ect­ory tables, and views.

CREATE OR REPLACE STREAM my_stream ON TABLE my_table
APPEND_ONLY = TRUE;

Insert-only: Insert-only streams track row inserts only, like append-only streams; how­ever, they are only sup­por­ted on external tables. Files that are ref­er­enced by external tables on cloud stor­age loc­a­tions are over­writ­ten if an old file is deleted and a new one replaces it. Please note that the auto­matic refresh of external table metadata may not occur in all cases.

CREATE OR REPLACE STREAM my_stream ON EXTERNAL TABLE my_table
INSERT_ONLY = TRUE;

What are Snow­flake Tasks?

A task is also an object type in the Snow­flake envir­on­ment; it defines a recur­ring sched­ule for activ­it­ies. It is recom­men­ded to use tasks to execute SQL state­ments, includ­ing state­ments that query data from the stored pro­ced­ures. Moreover, developers can effect­ively man­age tasks con­tinu­ously and con­cur­rently, con­sidered the best prac­tice for more com­plex, peri­odic processing.

Data pipelines are gen­er­ally con­tinu­ous, so tasks will use streams, which offer an enhanced approach for the con­tinu­ous pro­cessing of new or mod­i­fied data. Moreover, a task can also verify whether a stream con­tains changed data for a table. If no changed data exists, a task can determ­ine whether the pipeline has con­sumed, altered, or skipped the data – this is why the use of Snow­flake trig­gers is so widespread.

Data Streaming via Snowflake Connector for Kafka and via Snowpipe, to process the streamed data into the database

How to Set Up a Snow­flake CDC with Streams

Let’s see how streams work through an example: con­sider a table named EMPLOYEES_RAW where the raw data is staged, and which ulti­mately needs to be loaded into the EMPLOYEES table.

Firstly, you’ll need to cre­ate a data­base in Snow­flake; you can do so using the fol­low­ing commands:

create a database in Snowflake

The table named EMPLOYEES_RAW is cre­ated and three records are inserted:

The table named EMPLOYEES_RAW is created and three records are inserted

Now we´ll cre­ate the EMPLOYEES table, and the for the first load we´ll copy data from EMPLOYEES_RAW dir­ectly using an Insert statement:

create the EMPLOYEES table, and copy data from EMPLOYEES_RAW

The data now in the EMPLOYEES_RAW and EMPLOYEES tables is in sync. Let’s make a few changes to the data in the raw table and track these changes through a stream:

Create MY_Stream for table EMPLOYEES_RAW

At the out­set, when query­ing a stream, it will return null records as no DML oper­a­tions have been per­formed on the raw table yet:

Result for Select from MY_STREAM

Let´s insert two records and update two records in the raw table, and then verify the con­tents of the stream called MY_STREAM:

insert two records and update two records in the raw table. View results via select.

In the stream we can observe that:

  • Employee records with IDs 104 and 105 have been inser­ted.
    The METADATA$ACTION for these records is set as INSERT and METADATA$UPDATE is set as FALSE.
  • The employee records with IDs 102 and 103 which have been updated have two sets of records, one with METADATA$ACTION set as INSERT and the other as DELETE.
    The field METADATA$UPDATE is set as TRUE for both the records, indic­at­ing that these records are part of an UPDATE operation.

Before con­sum­ing the data from the stream, let´s per­form another DML oper­a­tion on the already mod­i­fied data and see how the stream updates its data:

Delete row with ID 102 from EMPLOYEE_RAW and show results in MY_STREAM.

Note that streams record the dif­fer­ences between two off­sets. If a row is updated and then deleted in the cur­rent off­set, the delta change is a deleted row.

The same can be observed for the employee record with ID = 102, where the data is first updated and then the row is deleted. The stream only cap­tures the delta change, which is a deletion.

In the above example, although we have inser­ted two records, updated two records and deleted one record, the final changes required to be cap­tured from the raw table are two inserts, one update and one deletion.

The fol­low­ing select state­ments on the stream give the details of the records which need to be inser­ted, updated and deleted in the tar­get table:

MY_STREAM entries corresponding to Insert action
MY_STREAM entries corresponding to Update action
MY_STREAM entries corresponding to Delete action

Finally, we can use a MERGE state­ment with the stream using these fil­ters to per­form the insert, update and delete oper­a­tions on the tar­get table, as shown below:

Performing insert, update and delete by using a MERGE statement

The image below shows that the merge oper­a­tion inser­ted two records, updated one and deleted another, as expected.

merge operation inserted two records, updated one and deleted another

Now that we have con­sumed the stream in a DML trans­ac­tion, it no longer returns any records and is set to a new off­set. So, if you need to con­sume the stream for mul­tiple sub­sequent sys­tems, the solu­tion is to build mul­tiple streams for the table, one for each consumer.

Stream Stale­ness

A stream turns stale when its off­set is out­side the data reten­tion period for its source table. When a stream becomes stale, access to the his­tor­ical data for the source table is lost, and this includes any uncon­sumed change records.

To view the cur­rent stale­ness status of a stream, execute the DESCRIBE STREAM or SHOW STREAMS com­mand. The STALE_AFTER column timestamp indic­ates when the stream is cur­rently pre­dicted to become stale:

The STALE_AFTER column timestamp indicates when the stream is currently predicted to become stale

To avoid streams going stale, it is highly recom­mend­able to reg­u­larly con­sume the changed data before its STALE_AFTER timestamp.

If the data reten­tion period for a table is less than 14 days, and a stream has not been con­sumed, Snow­flake tem­por­ar­ily extends this period to pre­vent it from going stale. The period is exten­ded to the stream’s off­set, with a default max­imum of 14 days, regard­less of the Snow­flake edi­tion asso­ci­ated with your account.

How to Sched­ule a Snow­flake Stream with Tasks

Using Snow­flake tasks, you can sched­ule a MERGE state­ment and run it as a recur­ring com­mand line oper­a­tion. In this sec­tion, using the same example as before, we will execute the MERGE com­mand using tasks.

Run the SQL com­mand given below to cre­ate a task:

SQL command to create a task for the MERGE command

Check the status of the task using this command:

show tasks command

By default, the task remains sus­pen­ded after cre­ation, as we can see below:

the task remains suspended after creation

To start the task, run the fol­low­ing command:

After starting the task, it is shown as started in the task list.

Now insert the CDC data into EMPLOYEE_RAW, and this data­set will sub­sequently appear as a stream dataset:

insert the CDC data into EMPLOYEE_RAW

This query below will show you when the task will next run:

SQL command to show when the task will run next
table showing results after running the task

Con­clu­sion

Snow­flake Change Data Cap­ture has totally replaced the tra­di­tional meth­ods of imple­ment­ing CDC. Although Snow­flake is already an industry-lead­ing cloud-based data plat­form known for its speed and flex­ible ware­housing options, Snow­flake CDC sig­ni­fic­antly enhances its value and utility.

Snow­flake CDC really shines in cases where mil­lions of records undergo trans­ac­tions on a daily basis, but you only want to update the mod­i­fied ones. Doing a full load will eat up your resources, so har­ness the power of Snow­flake CDC and use the MERGE com­mand to update the destination.

If you´d like fur­ther inform­a­tion or some assist­ance in lever­aging Snow­flake CDC, or if you have any other inquir­ies about cloud-based data plat­forms, don’t hes­it­ate to get in touch with us! Our ded­ic­ated team of cer­ti­fied experts is ready to address your ques­tions and provide any sup­port that you might need.