Amazon Macie is a managed service that uses machine learning (ML) and deterministic pattern matching to help discover sensitive data that’s stored in Amazon Simple Storage Service (Amazon S3) buckets.
In this post, we show you how to gain visibility of sensitive data embedded in images that are stored within your S3 buckets by adding an additional conversion layer to extract image-based data into a format supported by Macie. The solution also uses the recommended set of managed identifiers and custom data identifiers supported by Macie to cover most use cases.
The solution is deployed using AWS Serverless Application Model (AWS SAM), which is an open source framework for building serverless applications.
The resulting JSON file from the Amazon Textract job is stored within the same S3 bucket as the original image.
Macie then scans the bucket for sensitive data based on managed identifiers and your custom data identifiers.
It’s important to note the language capabilities of Amazon Textract. Amazon Textract can extract printed text and handwriting from the standard English alphabet and ASCII symbols.
This solution has been designed to enable sensitive data discovery of text in image objects within a single S3 bucket. To expand the scope to include multiple S3 buckets, some additional code and permission changes are required to allow the Lambda functions to process and access multiple existing S3 buckets.
If you want to extend the benefits of Amazon Macie to scan your databases for sensitive data, you might find these blog posts useful:
In this post, you learned how to enhance the capabilities of Amazon Macie to conduct sensitive data discovery within image files. With this solution, you can extend the benefits of Amazon Macie beyond structured file formats.
If you have feedback about this post, submit comments in the Comments section.