The article describes how the Amazon TimeHub team handled a disruption in the AWS DMS CDC task caused by an Oracle RESETLOGS scenario.
The RESETLOGS scenario in Oracle resets the log sequence number to 1, causing AWS DMS to fail when looking for the next LSN.
The article details three options to recover a failed task with limitations and how Amazon TimeHub team chose the third option to minimize the risk of potential data loss.
The article then describes how the team built an operational framework to detect the RESETLOGS operation and validate data discrepancies caused by failover scenarios.
The RESETLOGS data validation operates independently from AWS DMS and uses a custom validation framework that functions independent of redo logs.
The data validation framework queries data from both the source and target environments based on audit columns, keeping a buffer in terms of the window of the source failure.
The high-level operational workflow diagram is depicted above, including the steps to be taken in case of RESETLOGS failure or non-failure.
In the next blog part of this series, the article will discuss how they developed a data validation framework to recover from disaster and disruption scenarios, maintaining data integrity between source and target.
The article concludes by listing the authors and their roles in the Amazon TimeHub team.