menu
techminis

A naukri.com initiative

google-web-stories
Home

>

Big Data News

>

Building a...
source image

Amazon

1M

read

403

img
dot

Image Credit: Amazon

Building and operating data pipelines at scale using CI/CD, Amazon MWAA and Apache Spark on Amazon EMR by Wipro

  • Wipro addresses challenges faced by businesses in managing data pipelines by developing a programmatic data processing framework.
  • The framework integrates Amazon EMR runtime for Apache Spark and AWS Managed services for scalability and automation.
  • It streamlines ETL processes by orchestrating job processing, data validation, transformation, and loading into specified targets.
  • Components include Amazon MWAA, Amazon EMR on Amazon EC2, Amazon CloudWatch, Amazon S3, Amazon EC2 for Jenkins build server.
  • CI/CD pipelines automate deployment, triggering when code is pushed to Git, building artifacts for Amazon EMR usage.
  • Amazon MWAA handles data pipeline orchestration, scheduling, and execution using Airflow.
  • Fault tolerance is enhanced with the ability to recover data post-Amazon EMR termination, ensuring job continuity.
  • The solution offers scalability, flexibility for customization, support for various file formats, concurrent execution, and proactive error notification.
  • Average DAG completion time is 15–20 minutes, handling 18 ETL processes concurrently with large record volumes.
  • The framework by Wipro leverages AWS services to provide cost-effective, scalable, and automated data processing solutions.
  • Users are encouraged to utilize Amazon MWAA for ETL jobs on Amazon EMR Runtime for Apache Spark.

Read Full Article

like

24 Likes

For uninterrupted reading, download the app