menu
techminis

A naukri.com initiative

google-web-stories
Home

>

Big Data News

>

Build Writ...
source image

Amazon

2w

read

362

img
dot

Image Credit: Amazon

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

  • Managing high quality data while vetting reliability of continuously incoming data is a challenge for data-driven organizations. AWS Glue helps manage data quality via Data Quality Definition Language (DQDL) and Apache Iceberg is table format for data lake management. DLQ and WAP strategy used for vetting data quality in streaming environment, each with their unique advantages. DLQ focuses on efficiently segregating high-quality data where problematic entries are redirected to a separate DLQ. On the other hand, WAP pattern uses Iceberg branch feature to segregate problematic entries from high-quality data so that only clean data is published in the main branch. The multistep process of WAP pattern results in data latency for downstream consumers. It is implementation dependent and thus requires more sophisticated orchestration compared to DLQ approach.
  • WAP pattern implements a three-stage process: write, audit and publish where data is initially written to a staging branch, data quality checks are performed on staging branch and validated data is merged into main branch for consumption. Iceberg's branching feature is particularly useful for efficiently implementing WAP pattern. Each branch can be referred to and updated separately and ACID transactions and schema evolution provide useful ways for handling multiple concurrent writers in data with varying schemas.
  • An example use case to demonstrate ice-berg branching and AWS glue data quality is given where home monitoring system tracks room temperature and humidity. Incoming data for temperature and humidity is evaluated for quality before being visualized so that only qualified room data is used for further data analysis. The alert system focuses on quality checks using AWS Glue Data Quality and the room data is evaluated against a rule set that contains the normal temperature range of -10 to 50 degrees. A new branch, namely the audit branch containing only valid room data is created. Finally, the publish phase fast-forwards the validated data to the main branch, after which the valid data is now ready for use by downstream applications.

Read Full Article

like

21 Likes

For uninterrupted reading, download the app