Generating metadata for data assets is time-consuming but can be automated using generative AI. Amazon Bedrock offers a choice of high-performing FMs such as AI21 Labs, Anthropic, Cohere, etc. for metadata generation. In this solution, the AWS Glue Data Catalog is enriched with dynamic metadata using foundation models (FMs) on Amazon Bedrock and data documentation.
This post shows two approaches to generate descriptive metadata for tables in the Data Catalog with two different generative AI models available in Amazon Bedrock: In-context learning and Retrieval Augmented Generation (RAG).
In the in-context learning approach, a model generates the metadata descriptions without documentation, whereas the RAG approach uses external documentation to generate richer and more accurate metadata.
After ingesting data from a public Amazon Simple Storage Service (S3), the approach was first deployed to a small database.
The metadata generation process involves chunking the content of an HTML page of data documentation, generating and storing vector embeddings for data documentation, informing the model which information to generate by providing instructions, sending the promt to the model, and updating the table metadata in the Data Catalog.
The approaches demonstrated showcase the flexibility and versatility of this solution. Employing generative AI to improve and add metadata to existing data assets unlocks new levels of data intelligence, empowering your organization to make more informed decisions and drive data-driven innovation and unlock the full value of your data.