ORiGAMi: A Machine Learning Architecture for the Document Model

A naukri.com initiative

New

ORiGAMi: A...

Mongodb

100

Image Credit: Mongodb

ORiGAMi is a Transformer-based architecture designed for supervised learning on semi-structured data like JSON in a document model database.
It addresses the challenges faced by the ML community in working with semi-structured formats compared to traditional tabular data.
The architecture tokenizes documents into key-value pairs and structural tokens, making prediction directly from semi-structured documents possible.
By training on datasets with as few as 200 labeled samples, ORiGAMi combines data efficiency with Transformer model flexibility.
The model's token sequences serve as input for predicting the next token, ensuring valid document generation.
ORiGAMi reformulates classification to predict any field within a document, eliminating the need for separate models or pipelines.
Example use case includes user segmentation based on user profiles containing nested structures like device history and subscription details.
With ORiGAMi, users can make predictions on raw documents, preserving nested structures and updating predictions as user behavior changes.
The architecture is open-sourced on GitHub, with command-line interfaces for training models and making predictions seamlessly.
ORiGAMi provides a way for document-native machine learning, inviting users to explore, contribute, and apply it to real-world problems.

Read Full Article

6 Likes

For uninterrupted reading, download the app