The Amazon Timehub team built a data replication framework using AWS Database Migration Service (AWS DMS) to replicate data from an Oracle database to Amazon Aurora PostgreSQL-Compatible Edition. In this post, we explain our approach to address resilience of the ongoing replication. We created a robust reliability benchmark testing and data validation framework, which tests the resilience mechanisms, such as failover scenarios, scaling, and monitoring, to make sure the solution can handle disruptions and provide accurate alerts.
As part of our vision to build a resilient data replication framework, we focused on the resilient architecture, monitoring, and data validation. The resiliency mechanisms had to be addressed across the source system, AWS DMS, network, and Aurora PostgreSQL.
The failure scenarios we tested were failures at the source, failures in AWS DMS processing, and failures at the target. We wanted to check AWS DMS' reaction and ability to handle these scenarios gracefully without requiring manual intervention.
AWS DMS publishes metrics to Amazon CloudWatch to help measure propagation delay from source to target. In addition, Aurora offers monitoring on system metrics with alarms set on 12 such metrics to monitor CPU utilization, read, and write IOPs, and disk queue depth. AWS DMS performance degrades when the records are fetched from disk, compared to when they are fetched from memory. Hence, monitoring is key.
We built a custom monitoring framework to integrate custom metrics to CloudWatch that aren't available out of the box. This helps engineers analyze and identify any underlying issues related to replication lags and issues.
In conclusion, with the fault-resilient framework we built for data replication using AWS DMS and Aurora PostgreSQL-Compatible, we can avoid data integrity-related issues and impacts to downstream systems. The key metrics that we monitor detect issues early and react to such situations in a controlled manner.