Customers often seek architecture guidance on building streaming extract, transform, load (ETL) pipelines to destination targets such as Amazon Redshift.
This post outlines the architecture pattern for creating a streaming data pipeline using Amazon Managed Streaming for Apache Kafka (Amazon MSK).
Streaming ETL and batch ETL are approaches used for data integration, processing, and analysis.
We begin with an architectural overview to understand the various components involved in the streaming data pipeline.
A streaming pipeline begins with unprocessed asynchronous event streams and finishes with a structured table of optimized insights.
The example architecture includes a transactional database running on Amazon RDS for SQL Server, a Debezium connector running on Amazon MSK Connect infrastructure to ingest the data stream, an Amazon MSK cluster as stream storage, AWS Glue to handle data transformation and processing between the MSK cluster and the analytics target, and an Amazon Redshift cluster as the final data store for running analytics queries.
This makes Kafka flexible and effective for handling real-time data across a wide range of industry use cases. Furthermore, Amazon Redshift integrates with MSK, facilitating low-latency, high-speed ingestion of streaming data directly into an Amazon Redshift materialized view.
To implement this solution, you should clearly define your streaming ETL requirements, including the data sources, transformation logic, and the destination data store.
Both streaming ETL and batch ETL approaches used by organizations, depending on the nature of their data processing needs.
In this example architecture, we construct a pipeline that reads from an RDS for SQL Server as a data source and write the transformed data into an Amazon Redshift cluster.