Build a data lakehouse in a hybrid Environment using Amazon EMR Serverless, Apache DolphinScheduler, and TiDB

A naukri.com initiative

New

Build a da...

Amazon

230

Image Credit: Amazon

Building a serverless data lakehouse on AWS Cloud involves using services like Amazon EMR Serverless, Amazon Athena, Amazon S3, Apache DolphinScheduler, and TiDB.
The solution uses TiDB as the on-premises enterprise data warehouse, where data is processed by Amazon EMR Serverless Job to implement data lakehouse tiering logic.
Different tiers like ODS (Operational data store) and ADS (Analytical data store) are stored in separate S3 buckets within the same Amazon S3.
Apache DolphinScheduler aids in job orchestration, offering benefits like scalability, task-level controls, and multi-tenancy capabilities.
Configuring DolphinScheduler requires strong DevOps capabilities, as it involves setup and maintenance effort.
Prerequisites include creating an AWS account, IAM user setup, DolphinScheduler installation, IAM configuration for EMR serverless job, and TiDB Cloud table provisioning.
Data synchronization between on-premises TiDB and AWS involves using TiDB Dumpling to sync historical and incremental data to Amazon S3.
EMR Serverless jobs are used to sync data between AWS Glue tables and on-premises databases like TiDB.
Integration with DolphinScheduler involves switching DolphinScheduler Resource Center storage from HDFS to Amazon S3 for improved job status checking and orchestration.
Cleaning up resources post-implementation is recommended using AWS APIs to delete EC2 instances, RDS instances, and EMR Serverless applications.

Read Full Article

13 Likes

For uninterrupted reading, download the app