menu
techminis

A naukri.com initiative

google-web-stories
Home

>

Big Data News

>

Build a da...
source image

Amazon

1w

read

230

img
dot

Image Credit: Amazon

Build a data lakehouse in a hybrid Environment using Amazon EMR Serverless, Apache DolphinScheduler, and TiDB

  • Building a serverless data lakehouse on AWS Cloud involves using services like Amazon EMR Serverless, Amazon Athena, Amazon S3, Apache DolphinScheduler, and TiDB.
  • The solution uses TiDB as the on-premises enterprise data warehouse, where data is processed by Amazon EMR Serverless Job to implement data lakehouse tiering logic.
  • Different tiers like ODS (Operational data store) and ADS (Analytical data store) are stored in separate S3 buckets within the same Amazon S3.
  • Apache DolphinScheduler aids in job orchestration, offering benefits like scalability, task-level controls, and multi-tenancy capabilities.
  • Configuring DolphinScheduler requires strong DevOps capabilities, as it involves setup and maintenance effort.
  • Prerequisites include creating an AWS account, IAM user setup, DolphinScheduler installation, IAM configuration for EMR serverless job, and TiDB Cloud table provisioning.
  • Data synchronization between on-premises TiDB and AWS involves using TiDB Dumpling to sync historical and incremental data to Amazon S3.
  • EMR Serverless jobs are used to sync data between AWS Glue tables and on-premises databases like TiDB.
  • Integration with DolphinScheduler involves switching DolphinScheduler Resource Center storage from HDFS to Amazon S3 for improved job status checking and orchestration.
  • Cleaning up resources post-implementation is recommended using AWS APIs to delete EC2 instances, RDS instances, and EMR Serverless applications.

Read Full Article

like

13 Likes

For uninterrupted reading, download the app