Amazon SageMaker Lakehouse unifies data from Amazon S3 data lakes and Redshift data warehouses, allowing organizations to build powerful analytics and machine learning applications on a single copy of data.
AWS Glue Iceberg REST Catalog allows Apache Spark users to access SageMaker Lakehouse capabilities, making it possible to write/read data operations against Amazon S3 tables.
The article provides a solution overview for creating an AWS Glue database and using Apache Spark to work with Glue Data Catalog.
Lake Formation is used to centrally manage technical metadata for structured and semi-structured datasets for efficient and secure data governance.
Lake Formation permissions are enabled for third-party access, which allows the user to register a S3 bucket with Lake Formation.
To use OSS Spark role in data access, grant resource permission to the spark_role by providing full table access and data location permission.
A Pyspark script is set up to use an AWS Glue Iceberg REST catalog endpoint.
Users can validate read/write operations to the Iceberg table on Amazon S3 and view the data in the Iceberg table in Athena.
The architecture can be adapted to suit users' needs, whether they're running Spark on bare metal servers in their data center, in a Kubernetes cluster, or any other environment.
The article is written by Raj Ramasubbu, Sr. Analytics Specialist Solutions Architect, Srividya Parthasarathy, Senior Big Data Architect, and Pratik Das, Senior Product Manager, with AWS Lake Formation.