A large-scale vision-language dataset derived from open scientific literature, Biomedica, has been introduced to advance biomedical generalist AI.
The dataset contains over 6 million scientific articles, 24 million image-text pairs, and 27 metadata fields, including expert human annotations.
Scalable streaming and search APIs are provided for easy access to the dataset, facilitating seamless integration with AI systems.
The utility of the Biomedica dataset has been demonstrated through the development of embedding models, chat-style models, and retrieval-augmented chat agents, outperforming previous open systems.