Open table formats are critical to transactional data lakes and offer features such as partitioning, schema evolution, time-travel capabilities, and ACID transactions, addressing traditional problems in data lakes.
Apache XTable facilitates seamless conversions between OTFs eliminating many of the challenges associated with table format conversions.
This post explores how Apache XTable combined with the AWS Glue Data Catalog enables background conversions between OTFs with minimal or no changes to existing pipelines in a scalable and cost-effective way provided by AWS.
XTable works by translating table metadata using the existing APIs of OTFs, enabling interoperability through commonalities among Hudi, Iceberg, and Delta Lake.
XTable provides two metadata translation methods - Full Sync, which translates all commits, and Incremental Sync, which only translates new, unsynced commits for greater efficiency with large tables.
Detection for tables to be scanned for conversion is based on a Lambda function that scans the Data Catalog for tables that are candidates for conversion.
XTable is focused on achieving feature parody with OTFs' built-in features, including adding capabilities such as support for Merge-on-Read tables, and syncing table formats across multiple catalogs like the AWS Glue, Hive and Unity catalog.
In practice, XTable can be used in a broad range of analytical workloads, including business intelligence and machine learning. Amazon S3 stores data lakes, where OTFs are stored, allowing you to take advantage of AWS-native services like EMR for processing data, Athena to analyse data, and SageMaker to build machine learning models.
In this post, the authors demonstrated how to build a background conversion job for OTFs, using XTable and the Data Catalog, which is independent from data pipelines and transformation jobs.
This Lambda based XTable deployment can be reused in other solutions to allow for near real-time conversion of OTFs, which can be invoked by Amazon S3 object events resulting from changes to OTF metadata.