Scroll Top

RDS to RDS Migration: The Story of Migrating 2 Billion Records

Home RDS to RDS Migration: The Story of Migrating 2 Billion Records

By Chandraleka Ambi, Hariharan Krishnamurthi , Shankar Dhandapani Technical Blog May 5, 2023

Migrating billions of records

We had an interesting use case from a leading global tracking solutions provider to migrate billions of records from GPS tracking data.

This GPS tracking data was critical for the customer as they use it for important business analytics. They have around 2 billion records (as of March 2021) spread across 100s of databases, and nearly 1 lakh of tables (MySQL) & Amazon DynamoDB tables with more than 1.5TB of items. With the help of this data, they prepare several reports, perform analytical work, etc., and use them for their actual applications and software. Multiple storages like On-premises Data centers, Amazon RDS (Relational DB Services), Amazon DynamoDB, etc., contained their scattered Data. It was getting difficult since this increased their time and cost in getting proper reports from multiple sources. As they were scaling their business worldwide, it also impacts their business productivity in a huge way.

To solve this problem, it is necessary to consolidate the data from multiple sources into one master source. This data is greatly necessary for reports, analytics, etc.

How did we do it?

To consolidate the data, we had to perform several stages of data processing and data migrations. This came from multiple sources into a single master database in RDS Aurora MySQL. We took a step-by-step approach which involved the following migration scenarios:

RDS to RDS Migration
DynamoDB to RDS Migration
Target Table Schema Changes

In this article, we will discuss in detail about our approach, challenges and the solution implemented for RDS-to-RDS Migration.

RDS-to-RDS Migration

The RDS data has spread across multiple databases and tables and gets difficult to prepare reports and manipulate data. This is because, they need to join different tables from different databases to get a detailed report for their applications.

Applying Data Transformation

Upon analysis, it was found that only certain attributes from specific tables and databases are needed for reporting. We had to perform data transformation and then migrate to a single table in the new target database.

We will simultaneously push the new live data and migrate to the new database.

Output-table Schema:

There were nearly one hundred thousand tables in the source databases. Out of which, we picked forty to fifty columns, filtered and manipulated the values and assembled them in a single table as the final output. The schema of the output table has only forty-three columns that are most required for the application reports and analysis.

Some Factual Numbers:

Total No of Tables	Total No of Records Migrated
94562	927271108

Our Approach:

Initially, we had planned to fetch the required data, transform it, and export them into chunks of CSV (stored in AWS S3). Then, import them into a single table in the target database.

The Challenge:

The target table becomes locked for the entire import period. This makes it impossible to feed live data during the time.

We then took an alternative approach of extracting the data from the source table, transforming it and loading it directly. This is done with the help of EMR Spark through batch processing and also helps to overcome the table-lock issue.

Further Developments:

We decided to use Amazon EMR (Elastic MapReduce) Service to achieve our goal here.

As shown in the above architecture diagram, we used the Amazon Lambda, AWS CloudWatch Event Rule, and AWS EMR services to work out this migration in a successful way.

AWS Service – Functionalities:

Each of those AWS services has a specific function to perform in the overall Migration process.

AWS EMR: To extract the data from source tables, perform the transformation and then load the transformed data into the target database table.
AWS CloudWatch Event Rule: To trigger the lambda function once in every two minutes.
Amazon Lambda Function: To drive the migration process with the help of temporary tracking table.
- It reads the table or data that needs to be migrated from the temporary table and feeds that as input to the EMR Spark job.
- Once the given data is migrated by the EMR, lambda will update the details in the temporary table with the status.

Pre-requisites:

The deployment of the overall AWS resources in the environment was implemented using the Pulumi script. Important prerequisites to be implemented before the migration are:

Target RDS-Database Setup:
- The new target DB must freshly created. This comes with a custom KMS key, parameter group, security group, VPC (Virtual Private Cloud), subnet, etc.
- Prepare the table schema with all the required attributes, data types, indexes, etc., to create the target table.
- Once the table is ready, repoint the live data streaming to it.
Source RDS-Database Setup:
- Add the new reader instances in the existing source RDS cluster to support and handle countless read queries during the migration process.

Once they are in place, a temporary table was created in the target database to track the status of the data migration process. This table consists of the account id (which was fetched from the main source table to track the migration process based on that attribute), starting & ending timestamp, migration status, error indications, the time elapsed, no. of the total data that have been migrated so far, etc.

Migration Steps:

Initially, the Event rule will trigger the lambda function. This will in turn read the account id and trigger the attribute value from the temporary tracking table. If the triggered value is 0 (means the migration is not triggered yet), it will feed the accountId as an input to the EMR Spark jobs. Else it will skip that account id.
The Spark application fetches the required columns for which account id needs to be migrated to the target database.
- It will do transformations to those column values and load them into the target database table as batch processing.
- Each account id will have several thousands of rows in the source databases, so each will take several hours to complete the migration process.
Once the migration of a particular account id is completed, the lambda will update the remaining columns in the temporary table.

The above process will be repeated for all the accountId till all the triggered columns become 1.

Challenges Involved:

Charset Issue
- The Target table Charset/Collation is Latin1 which does not allow unknown characters. In one of the source tables, some “address” column has a Korean/Chinese value (Which is a data issue caused by google translate from lat and long)
- So, record insert will not be allowed in the target table, or it will be gibberish, check out the below address column:
- The challenge here is normal ALTER query will lock the whole table till it converts the required change, and with the table having Millions of records, there will be at-least 2hr downtime.
- To handle the challenge, Percona Toolkit is used for Modifying the Charset https://www.percona.com/doc/percona-toolkit/3.0/pt-online-schema-change.html which helps to alter the table without affecting the query process or table lock.

“CONVERT TO CHARACTER SET to utf8mb4 COLLATE utf8mb4_unicode_ci”

Query-Bottleneck Issue
- A Join SQL statement is written between two different tables to extract the data.
- The query did not have a potential issue because it worked for the table having large data. But when we joined a smaller table than the parent table, it was slow.
- To get rid of this problem, we disabled the “block_nested_loop” parameter in the RDS parameters group.
CPU Utilization Issue
- Aurora MySQL 5.6.22 cannot handle a heavy load of batch write requests, which reduces the performance of the running query.
- This is a known issue in MySQL 5.6 which was resolved in the later version.
  (Ref)
- Decreasing the number of threads or number of instances that process the data (A little Time Consuming).

How did we ensure Zero Data loss and Data Integrity?

For ensuring zero data loss, we compared the sum of the totalCount and the sum of the insertCount. If both are equal, then it was considered to be Zero data loss.

For data integrity, we generated a few sample reports for a particular period of time using old database tables and newly migrated database tables and tested with a few API calls and the results manually to make sure there are no differences between them.

Conclusion

This entire process helped the client to reduce their time in analysis, consolidate multiple data sources, and generate meaningful reports for better business outcomes. It took nearly a month of effort to build this solution and almost 80 hours of runtime to migrate more than 1.25 TB of data with nearly 1 billion records. It was a challenging task, considering the importance of the data.

In this article, we discussed in brief the RDS-to-RDS migration and in our upcoming article, we will discuss how we went ahead with DynamoDB to RDS migration and Target Table Schema Changes.

Chandraleka Ambi

+ posts

Hariharan Krishnamurthi

+ posts

Shankar Dhandapani

+ posts

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use nonessential cookies that help us analyze and understand how you use this website and enhance your user experience. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	This cookie is set by Google. In addition to certain standard Google cookies, reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other".
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Functional

Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.

Cookie	Duration	Description
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
Zoominfo	session	Zoominfo uses technologies to collect and store information when you interact with services it offer to their partners, such as advertising services or analytics. All of those processes are meant to improve your user experience and the overall quality of our services.

Analytics

Analytics cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_111355416_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	This cookie is used to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	This is set by Hotjar to identify a new user’s first session. It stores a true/false value, indicating whether this was the first time Hotjar saw this user. It is used by Recording filters to identify new user sessions.
_hjid	1 year	This is a Hotjar cookie that is set when the customer first lands on a page using the Hotjar script.
_hjIncludedInPageviewSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's pageview limit.
_hjIncludedInSessionSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's daily session limit.
_hjTLDTest	session	Hotjar test cookie to check the most generic cookie path it should use, instead of the page hostname. This is done so that cookies can be shared across subdomains (where applicable). To determine this, we store the _hjTLDTest cookie for different URL substring alternatives until it fails. After this check, the cookie is removed.
oktgid	1 year	This cookie is used for storing the visitor ID of the user who clicked on an okt.to link.
oktsid	session	This cookie is used for storing the session ID of the user who clicked on an okt.to link.

Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.

Cookie	Duration	Description
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by YouTube and is used to track the views of embedded videos on YouTube pages.

Other

Other uncategorized cookies are those that are being analyzed and have not yet been classified into a category according to their type and purpose.

Cookie	Duration	Description
__gwtCookieCheck	session	This cookie is used to check if the visitors' browser supports cookies.
AnalyticsSyncHistory	1 month	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
li_gc	2 years	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
UserMatchHistory	1 month	LinkedIn - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.

RDS to RDS Migration: The Story of Migrating 2 Billion Records

Chandraleka Ambi

Hariharan Krishnamurthi

Shankar Dhandapani

Contact Us

Contact Us

Contact Us

Contact Us

Contact Us

Contact Us

Contact Us