AWS Glue Data Catalog now automates generating statistics for new tables, integrated with CBO from Amazon Redshift Spectrum and Amazon Athena.
Table statistics are essential in optimizing queries on large datasets for join operations across multiple datasets.
Data Catalog previously supported collecting table statistics for table formats like Parquet, ORC, JSON, ION, CSV, and XML and Apache Iceberg tables.
The latest update allows administrators to configure weekly statistics collection across all databases and tables, optimizing the platform's cost-efficiency.
The feature enables flexible per-table controls, allowing individual data owners to manage table statistics per their requirements.
Catalog-level statistics collection can be enabled via the Lake Formation console or the AWS CLI.
With this feature, AWS Glue automatically updates column statistics for all columns in each table, using 20% of records to calculate statistics.
Individual data owners can configure scheduled collection configurations at the table level and customize settings for individual tables.
This feature will help in the efficient management of up-to-date column-level statistics to optimize query processing and cost-efficiency.
Try this feature for your use case, and share your feedback in the comments.