Building Real-Time Analytics: Streaming CDC from MySQL to Delta Lake with AWS Tools

Hand touching infographic cloud computing and technology icons ,

By Sai Purushoth Ganesan Technical Blog May 28, 2025

In modern data-driven applications, businesses need a way to track and process database changes in real time. Traditional batch processing methods are often inefficient and introduce latency. Change Data Capture (CDC) provides a solution by capturing incremental changes in a database and streaming them for further processing. This setup enables real-time analytics, data synchronization across multiple systems, and efficient event-driven architectures.

In this guide, we’ll walk through the steps to set up CDC for a MySQL database hosted on AWS RDS, use AWS Database Migration Service (DMS) to stream changes to Amazon Kinesis, further process this data using AWS Glue, and analyze it using AWS Athena.

Prerequisites

Before diving into the setup, ensure you have:

An AWS account.

A MySQL RDS instance with appropriate permissions.

Amazon Kinesis, AWS DMS, and AWS Glue access.

A Delta Lake-compatible storage location (e.g., S3).

Step 1: Enable Binary Logging on RDS

To capture database changes in real time, we need to enable binary logging on MySQL RDS. Binary logs record all changes to the database and make them available for replication. Without binary logging, AWS DMS cannot capture and stream changes efficiently.

Before we begin streaming changes, we must configure the RDS instance to log all modifications at the row level. This ensures that every update, insert, or delete operation is accurately captured.

Configure Binary Log Parameters:

In the RDS parameter group, set the following parameters:

```
binlog_format = ROW (mandatory)
```

```
binlog_row_image = FULL 
```
 (optional but recommended for full row data). Enabling binlog_row_image = FULL ensures that all events generated by the CDC code contain the database image before and after the change. For example, if a row contains the column table.a = 10 and the update sets the column table.a = 20, the binary log and Kinesis event will contain table.a = 10 as the before image and table.a = 20 as the after image.

Apply Parameter Group and Reboot:

Associate the parameter group with the RDS instance.

Reboot the instance for changes to take effect.

You can verify these parameters from the database using a MySQL client:

SELECT @@binlog_row_image AS binlog_row_image;  +------------------+  | binlog_row_image |  +------------------+  | FULL             |  +------------------+  SELECT @@binlog_format AS binlog_format;  +---------------+  | binlog_format |  +---------------+  | ROW |  +---------------+

Create a Database User for Replication: Execute the following SQL commands to create a user and grant necessary roles:

CREATE USER 'repl'@'%' IDENTIFIED BY 'slavepass';  GRANT REPLICATION SLAVE, REPLICATION CLIENT, SELECT ON *.* TO 'repl'@'%';

Set Binary Log Retention: Verify and configure binary log retention to ensure sufficient time for CDC processing:

CALL mysql.rds_show_configuration;  CALL mysql.rds_set_configuration("binlog retention hours", 24);

Step 2: Create a Kinesis Stream

With binary logging enabled, the next step is to create an Amazon Kinesis stream to capture and process the changes in real time. Kinesis acts as a highly scalable data pipeline that allows us to stream and analyze data efficiently.

Based on your requirements, create a Kinesis Data Stream:

Use on-demand mode for automatic scaling.

Use provisioned mode for precise resource control.

By integrating Kinesis, we ensure that every database change is captured and made available for downstream processing without delay.

Step 3: AWS DMS Setup

Now that we have our MySQL source and Kinesis stream ready, AWS Database Migration Service (DMS) is required to handle real-time replication. DMS continuously extracts changes from the MySQL database and streams them to Kinesis.

To allow AWS DMS to connect to our data sources, we need to define endpoints:

Create Endpoints

Source Endpoint: Specify the MySQL engine and credentials (use AWS Secrets Manager for secure credentials storage).

Target Endpoint: Specify Kinesis as the target and provide the ARN of the stream. Ensure the endpoint role has appropriate policies to access Kinesis.

Step 4: Serverless Replication (Preferable for CDC)

For a fully serverless setup, AWS provides serverless replication instances with built-in CDC (Change Data Capture) capabilities, simplifying infrastructure management while ensuring scalability.

Prerequisites

Create VPC endpoints for both the RDS source and Kinesis target.

Example service names:

For RDS: com.amazonaws.<region>.rds

For Kinesis:com.amazonaws.<region>.kinesis

Attach a security group to permit traffic within the VPC CIDR block. Also, associate the aforementioned security group with the RDS security group.

Serverless Replication Instance Setup

Select source and target endpoints.

Choose replication type as CDC.

Configure instance settings (e.g., capacity and multi-AZ).

Create a new subnet group under DMS, attach it, and assign the endpoint’s security group.

Verify that all components (RDS, Kinesis) are within the same VPC or connected.

Optional:

While serverless replication is recommended, some use cases may require a manually provisioned replication instance. This method provides greater control over resources and performance tuning.

Set Up a Replication Instance

Choose an appropriate instance class (single or multi-AZ).

Configure storage allocation based on your workload.

Create a subnet group for DMS using the subnets in your VPC.

Configure security groups to allow access to RDS and Kinesis.

Test the connection between the source and target endpoints.

Create a Migration Task

Use the replication instance, source endpoint, and target endpoint created earlier.

Choose the “migrate and replicate” option.

Adjust task settings as needed or leave them as default.

Specify the database and table schema to capture changes.

Disable premigration assessment.

Enable automatic start and create the task.

AWS DMS will begin streaming changes from RDS to Kinesis in real time.

Step 5: Process Data Using AWS Glue and Load It as a Delta Table

With changes now flowing into Kinesis, efficiently processing and storing this real-time data is crucial for analytics and decision-making.

AWS Glue Streaming provides a scalable, serverless solution to process streaming data from Kinesis in real time, enabling seamless ingestion, transformation, cleaning, and storage in a Delta Lake table.

This ensures structured, queryable, and ACID-compliant data, optimized for real-time analytics.

AWS Glue Setup

Create a Glue Streaming Job:

Use either Spark scripts or Visual ETL.
Specify the Kinesis stream ARN as the source.

Job Configuration:

Choose Spark streaming with the required configurations.
Provide a role for Glue with access to Kinesis, S3, and CloudWatch Logs.
Use the Delta Lake connector for Glue (ensure the Delta Lake library is included in the job configuration).

Prerequisite for AWS Glue (Initialize Delta Table if Necessary)

Ensure the Delta table exists at the desired location for streaming data. If not, initialize it with the following logic:

# Function to initialize Delta table if it doesn't exist  def initialize_delta_table():      if not DeltaTable.isDeltaTable(spark, delta_table_path):          # Create a dummy DataFrame to initialize the table          dummy_data = [(0, "FirstName","LastName", "email@example.com" ,"2024-01-01", 0.0)]          columns = ["employee_id","first_name", "last_name","email", "hire_date", "salary"]          dummy_df = (spark.createDataFrame(dummy_data, columns)                      .withColumn("effective_start", lit(current_timestamp()).cast(TimestampType()))                      .withColumn("effective_end", lit(current_timestamp()).cast(TimestampType()))                      .withColumn("is_current", lit(False))                      .withColumn("is_deleted", lit(False))                      )                    # Write as Delta table          #choose the partition columns based on your need          dummy_df.write.format("delta").partitionBy("hire_date","is_current").save(delta_table_path)    # Call the function to initialize the table before streaming starts  initialize_delta_table()

SCD Type 2 Implementation : Write Data to Delta Lake:

import sys  from awsglue.transforms import *  from awsglue.utils import getResolvedOptions  from pyspark.context import SparkContext  from awsglue.context import GlueContext  from awsglue.job import Job  from delta.tables import DeltaTable  import time  from pyspark.sql.functions import (      col,      from_json,      struct,      to_json,      lit,      current_timestamp,      expr,  )  from pyspark.sql.types import (      StructType,      StructField,      StringType,      IntegerType,      FloatType,      TimestampType,  )      ## @params: [JOB_NAME]  args = getResolvedOptions(sys.argv, ["JOB_NAME"])    sc = SparkContext()  glueContext = GlueContext(sc)  spark = glueContext.spark_session  job = Job(glueContext)  job.init(args["JOB_NAME"], args)    # Define the schema for the incoming data  data_schema = StructType(      [          StructField(              "data",              StructType(                  [                      StructField("employee_id", IntegerType(), True),                      StructField("first_name", StringType(), True),                      StructField("last_name", StringType(), True),                      StructField("email", StringType(), True),                      StructField("hire_date", StringType(), True),                      StructField("salary", FloatType(), True),                  ]              ),              True,          ),          StructField(              "metadata",              StructType(                  [                      StructField("record-type", StringType(), True),                      StructField("transaction-id", StringType(), True),                      StructField("schema-name", StringType(), True),                      StructField("partition-key-type", StringType(), True),                      StructField("table-name", StringType(), True),                      StructField("operation", StringType(), True),                      StructField("timestamp", TimestampType(), True),                  ]              ),              True,          ),      ]  )    stream_name = "cdc-poc-stream"  region = "us-east-1"  checkpoint_location = "s3:///checkpoints/"  delta_table_path = "s3:///employee"    # Read data from Kinesis  df = (      spark.readStream.format("kinesis")      .option("streamName", stream_name)      .option("region", region)      .option("startingPosition", "LATEST")      .load()      .select(from_json(col("data").cast("string"), data_schema).alias("parsed_data"))  )    # Extract data and metadata columns  data_df = df.select(      col("parsed_data.data.*"),      col("parsed_data.metadata.operation"),      col("parsed_data.metadata.timestamp"),  )      # Function to handle SCD Type 2 operations with is_current and is_deleted  def process_scd_type_2(batch_df, batch_id):      delta_table = DeltaTable.forPath(spark, delta_table_path)        # Mark existing rows as expired and set is_current to False      updates_df = batch_df.filter(col("operation").isin(["insert", "update"])).drop(          "operation"      )      expired_ids = updates_df.select("employee_id").distinct()      delta_table.alias("target").merge(          expired_ids.alias("updates"),          "target.employee_id = updates.employee_id AND target.effective_end IS NULL AND target.is_current = true",      ).whenMatchedUpdate(          set={"effective_end": current_timestamp(), "is_current": lit(False)}      ).execute()        # Insert new rows with is_current = True      new_rows_df = updates_df.withColumn(          "effective_start", lit(current_timestamp()).cast(TimestampType())      )      new_rows_df = new_rows_df.withColumn(          "effective_end", lit(None).cast(TimestampType())      )      new_rows_df = new_rows_df.withColumn("is_current", lit(True)).withColumn(          "is_deleted", lit(False)      )      delta_table.alias("target").merge(          new_rows_df.alias("source"),          "target.employee_id = source.employee_id AND target.is_current = true",      ).whenNotMatchedInsertAll().execute()        # Handle delete operations by setting is_deleted to True and is_current to False      delete_df = batch_df.filter(col("operation") == "delete").drop("operation")      if not delete_df.isEmpty():          delete_ids = delete_df.select("employee_id").distinct()          delta_table.alias("target").merge(              delete_ids.alias("deletes"),              "target.employee_id = deletes.employee_id AND target.is_current = true",          ).whenMatchedUpdate(              set={                  "effective_end": current_timestamp(),                  "is_current": lit(False),                  "is_deleted": lit(True),              }          ).execute()      # Write data to Delta table  query = (      data_df.writeStream.foreachBatch(          process_scd_type_2      )  # ForeachBatch handles the streaming operations      .option("checkpointLocation", checkpoint_location)      .start(delta_table_path)  )    query.awaitTermination()  job.commit()

Step 6: Register the Delta Table in the AWS Glue Catalog and Query using Athena

After creating the Delta table and streaming data into it, registering the table in the AWS Glue Catalog allows for easy management and discovery of its metadata.

This enables seamless querying through Amazon Athena, allowing you to run SQL queries on the real-time data without managing infrastructure, providing an efficient and scalable solution for analytics.

Register the Delta Table in AWS Glue Catalog

Create the table manually using Athena:

Open the Amazon Athena Console.

Navigate to the query editor and select the appropriate Glue database.

Use the following SQL statement to create an external table that points to your Delta table’s location in S3:

CREATE EXTERNAL TABLE deltadatabase.employee  LOCATION 'delta-table-location'  TBLPROPERTIES (      'table_type' = 'DELTA'  );

Verify Table Creation:

After executing the query, verify that the table is listed under the specified database in the Athena console.

Run SQL Queries:

Use the Athena query editor to query the table. For example:

SELECT * FROM deltadatabase.employee;    -- Fetch Active Records  SELECT * FROM deltadatabase.employee WHERE is_current = true;

Optional: Connect with Amazon QuickSight

You can connect the Athena table to Amazon QuickSight for data visualization. Simply add Athena as a data source in QuickSight, select the table, and build interactive dashboards for real-time insights.

Conclusion

Change Data Capture (CDC) is a crucial technique for capturing and tracking real-time changes in data, enabling modern data pipeline builds. This guide outlines the steps to set up CDC for a MySQL database hosted on AWS RDS using AWS Database Migration Service (DMS) to stream changes to Amazon Kinesis. The process involves enabling binary logging on RDS, setting up a Kinesis stream, configuring AWS DMS endpoints, and creating a replication task.
For further processing, AWS Glue allows you to transform and store this data into a Delta Lake table, which can then be registered in the AWS Glue Catalog and queried using Amazon Athena. Optionally, you can visualize the data using Amazon QuickSight for more profound insights. This setup ensures efficient real-time data processing, enabling powerful analytics and decision-making capabilities.

Sai Purushoth Ganesan

+ posts

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	This cookie is set by Google. In addition to certain standard Google cookies, reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other".
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
Zoominfo	session	Zoominfo uses technologies to collect and store information when you interact with services it offer to their partners, such as advertising services or analytics. All of those processes are meant to improve your user experience and the overall quality of our services.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_111355416_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	This cookie is used to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	This is set by Hotjar to identify a new user’s first session. It stores a true/false value, indicating whether this was the first time Hotjar saw this user. It is used by Recording filters to identify new user sessions.
_hjid	1 year	This is a Hotjar cookie that is set when the customer first lands on a page using the Hotjar script.
_hjIncludedInPageviewSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's pageview limit.
_hjIncludedInSessionSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's daily session limit.
_hjTLDTest	session	Hotjar test cookie to check the most generic cookie path it should use, instead of the page hostname. This is done so that cookies can be shared across subdomains (where applicable). To determine this, we store the _hjTLDTest cookie for different URL substring alternatives until it fails. After this check, the cookie is removed.
oktgid	1 year	This cookie is used for storing the visitor ID of the user who clicked on an okt.to link.
oktsid	session	This cookie is used for storing the session ID of the user who clicked on an okt.to link.

Cookie	Duration	Description
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by YouTube and is used to track the views of embedded videos on YouTube pages.

Cookie	Duration	Description
__gwtCookieCheck	session	This cookie is used to check if the visitors' browser supports cookies.
AnalyticsSyncHistory	1 month	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
li_gc	2 years	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
UserMatchHistory	1 month	LinkedIn - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.