Engineering teams are increasingly replacing batch data processing pipelines with real-time streaming, and building data lakes to store their data, adopting open data formats such as Parquet and Apache Iceberg to store their data.
This trend is being seen across many industries such as online media, gaming companies, factories monitoring equipment for maintenance and failure, theme parks providing wait times for popular attractions.
Apache Iceberg is becoming popular among customers storing their data in Amazon S3 data lakes, because it allows customers to read and write data concurrently using different frameworks.
Amazon Data Firehose simplifies the process of streaming data by allowing users to configure a delivery stream, select a data source, and set Iceberg tables as the destination.
Firehose is integrated with over 20 AWS services, and supports routing data to different Iceberg tables to have data isolation or better query performance.
This post describes how to set up Firehose to deliver data streams into Iceberg tables on Amazon S3, and addresses different scenarios for routing data into iceburg tables.
For instance, routing records to different tables based on the content of the incoming data by specifying a JSON Query expression can be accomplished by setting the 'Database expression' and 'Table expression' fields.
Alternatively, routing records to different tables based on the content of the incoming data can be achieved by using a Lambda function, as described in use case 4.
All of the AWS services used in these examples are serverless, and no infrastructure management is required.
Users can query data they’ve written to Iceberg tables using different processing engines such as Apache Spark, Apache Flink, or Trino, or use Amazon Athena.