Scroll Top

Change Data Capture using Snowflake Dynamic Tables

Copy of Cloud_Animation01

Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in near real-time to a downstream process or system. Dynamic tables in Snowflake can implement CDC, leveraging their flexibility and near real-time capabilities to capture and process data changes seamlessly.

Snowflake Dynamic Tables

When we talk about dynamic tables, we usually mean schemas or tables that are created, altered, or maintained using Snowflake’s SQL capabilities programmatically or in response to runtime conditions. Because Snowflake supports dynamic SQL and can handle schema changes and data transformations well, it is strongly tied to the idea of dynamic tables.

Dynamic tables in Snowflake provide a flexible way to manage schema evolution and adapt to changing data structures without the need for extensive schema modifications. This flexibility is important for implementing CDC because it lets tables change to meet the needs of the business while still capturing and processing data efficiently.

Dynamic Tables can join and aggregate across multiple source objects and incrementally update results as sources change. LAG  specifies the difference in time between the dynamic table and the base table that forms its foundation. Snowflake not only automatically optimizes queries to make them more efficient and reduce the need for human tuning, but it also keeps compute resources and storage separate. This way, scalable compute resources can handle data without affecting the scalability of storage.

With the help of all these characteristics, businesses can adopt a “Zero ETL” strategy.

Key Features of Dynamic Tables for CDC

Schema Evolution:

Schema evolution is vital to preserving continuity and data integrity in CDC scenarios. Schema evolution in Snowflake dynamic tables makes it easy to quickly upgrade and change schemas by allowing agile development and data modeling techniques. It also cuts down on manual schema updates that are prone to mistakes, which helps make the best use of resources.

Real-time Data Capture:

In the past, keeping streaming and batch architectures separate has been hard for streaming data because it meant managing two systems at once, which added extra work and made failures more likely.

The integration of batch and streaming data increases delay and complexity in pipelining.

Many users can stream without knowing how to use Spark, Flink, or other streaming systems. This is possible with dynamic tables, which let users use common SQL and strong stream processing features. Furthermore, dynamic tables eliminate the extra logic often required for incremental updates by automatically applying incremental updates for both batch and streaming data.

Data Vault:

Data vault modeling is a hybrid approach that combines traditional relational data warehouse models with newer big data architectures to build a data warehouse for enterprise-scale analytics. Dynamic tables function as materialized views, adapting dynamically to the data they support, making them an ideal fit for the information mart layer in a Data Vault Architecture.

 Implementing CDC using Dynamic Tables

A dynamic table allows us to specify a query, and the result materializes. It can track changes made to the query data we give it. It slowly updates the materialized results, solving declarative data transformation problems.

One key feature of this table that differentiates it from features like streams and tasks is that it eliminates the additional step. This step involves identifying and merging changes from the base table. The entire process is automatically performed within the dynamic table.

So, with this feature there is no longer a need to write code to transform and update the data in a separate target table. Dynamic tables support both incremental changes and Slowly Changing Dimensions (SCD) with row versioning.

Implementation of incremental changes:

Step 1—Setup Warehouse, DB, Schema
USE ROLE accountadmin; 

CREATE OR replace WAREHOUSE demo_wh warehouse_size = 'XSMALL'; 

USE WAREHOUSE demo_wh 

CREATE OR replace DATABASE demo_db; 

CREATE OR replace SCHEMA demo_schema; 

 

Step 2—Create raw table
CREATE OR REPLACE TABLE demo_db.demo_schema.orders_raw 

AS SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS LIMIT 100; 

 

Step 3—Create a dynamic table
CREATE OR REPLACE DYNAMIC TABLE demo_db.demo_schema.orders 

TARGET_LAG = '1 minute' 

WAREHOUSE = demo_wh 

AS SELECT * FROM demo_db.demo_schema.orders_raw; 

 

Step 4—Modify raw table
SELECT * FROM demo_db.demo_schema.orders WHERE O_CUSTKEY = 106660

 

UPDATE demo_db.demo_schema.orders_raw  

SET O_TOTALPRICE = 135445.43 

WHERE O_CUSTKEY = 106660 

 

Step 5—Check the CDC data reflection in the dynamic table.
SELECT * FROM demo_db.demo_schema.orders WHERE O_CUSTKEY = 106660 

 

 

Implementation of SCD2 with row versioning:

Step 1—Create raw table
CREATE OR REPLACE TABLE demo_db.demo_schema.inventory_raw 

AS 

SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL.INVENTORY limit 100; 

 

Step 2: Create a stage table.
CREATE OR REPLACE TABLE demo_db.demo_schema.inventory_stg 

AS 

SELECT * FROM demo_db.demo_schema.inventory_raw; 

 

Step 3—Create a dynamic table
CREATE OR REPLACE DYNAMIC TABLE demo_db.demo_schema.inventory_main  

TARGET_LAG = '1 minute' 

WAREHOUSE = demo_wh 

AS 

SELECT  

INV_DATE_SK ,INV_ITEM_SK,INV_WAREHOUSE_SK,INV_QUANTITY_ON_HAND, 

ROW_NUMBER () OVER (PARTITION BY INV_ITEM_SK ORDER BY INV_DATE_SK DESC) rnm, 

CASE WHEN rnm=1 THEN 'Y' ELSE 'N' END ACTION_CD, 

CASE WHEN rnm=1 THEN NULL ELSE CURRENT_DATE () END LOAD_END_DATE 

FROM demo_db.demo_schema.inventory_stg 

 

Step 4—Modify Stage Table
SELECT * FROM  demo_db.demo_schema.inventory_stg 
WHERE INV_ITEM_SK= 386545



INSERT INTO demo_db.demo_schema.inventory_stg  

VALUES (2451060, 386545,2,999) 

SELECT * FROM demo_db.demo_schema.inventory_stg 

WHERE INV_ITEM_SK= 386545.  

Step 5—Refresh Dynamic Table
ALTER DYNAMIC TABLE demo_db.demo_schema.inventory_main REFRESH 

 

Step 6—Check SCD reflection in the dynamic table
SELECT * FROM demo_db.demo_schema.inventory_main  

WHERE INV_ITEM_SK=386545 

 

Managing Dynamic Tables Refresh

Incremental refresh: 

When possible, the automated refresh process performs an incremental refresh. With incremental refresh, the automated refresh process analyzes the query for the dynamic table. It computes the changes to the query results (the changes since the dynamic table was last refreshed). The refresh process then merges those changes into the dynamic table.

Full refresh:

If the automated process is unable to determine how to perform an incremental refresh, the process performs a full refresh. With a full refresh, the automated refresh process performs the query for the dynamic table and materializes the results. This completely replaces the current materialized results of the dynamic table.

Conclusion

Dynamic tables possess immense capabilities for solving various use cases. They improve operational efficiency, availability, and CDC management. The goal of this blog was to provide an overview of dynamic tables along with several use case examples.

Vishali Sakthivel

+ posts