The AWS Glue Data Catalog supports automatic table optimization of Apache Iceberg tables, including compaction, snapshots, and orphan data management.
The data compaction optimizer constantly monitors table partitions and kicks off the compaction process when the threshold is exceeded for the number of files and file sizes.
The Iceberg table compaction process starts and will continue if the table or any of the partitions within the table has more than the configured number of files (default five files), each smaller than 75% of the target file size.
The snapshot retention process runs periodically (default daily) to identify and remove snapshots that are older than the specified retention configuration from the table properties.
Similarly, the orphan file deletion process scans the table metadata and the actual data files, identifies the unreferenced files, and deletes them to reclaim storage space.
To help achieve such requirements, we provide the capability where the Data Catalog optimizes Iceberg tables to run in your specific VPC.
By default, a table optimizer is not associated with any of your VPCs and subnets.
With this new capability of supporting data access from VPCs, you can associate a table optimizer with an AWS Glue network connection to run in a specific VPC, subnet, and security group.
This feature is available today in all AWS Glue supported AWS Regions.
The post includes a sample AWS CloudFormation template that enables a quick setup of the solution resources.