Building and operating data pipelines at scale using CI/CD, Amazon MWAA and Apache Spark on Amazon EMR by Wipro

A naukri.com initiative

New

Building a...

Amazon

403

Image Credit: Amazon

Wipro addresses challenges faced by businesses in managing data pipelines by developing a programmatic data processing framework.
The framework integrates Amazon EMR runtime for Apache Spark and AWS Managed services for scalability and automation.
It streamlines ETL processes by orchestrating job processing, data validation, transformation, and loading into specified targets.
Components include Amazon MWAA, Amazon EMR on Amazon EC2, Amazon CloudWatch, Amazon S3, Amazon EC2 for Jenkins build server.
CI/CD pipelines automate deployment, triggering when code is pushed to Git, building artifacts for Amazon EMR usage.
Amazon MWAA handles data pipeline orchestration, scheduling, and execution using Airflow.
Fault tolerance is enhanced with the ability to recover data post-Amazon EMR termination, ensuring job continuity.
The solution offers scalability, flexibility for customization, support for various file formats, concurrent execution, and proactive error notification.
Average DAG completion time is 15–20 minutes, handling 18 ETL processes concurrently with large record volumes.
The framework by Wipro leverages AWS services to provide cost-effective, scalable, and automated data processing solutions.
Users are encouraged to utilize Amazon MWAA for ETL jobs on Amazon EMR Runtime for Apache Spark.

Read Full Article

24 Likes

For uninterrupted reading, download the app