<ul><li>A large-scale vision-language dataset derived from open scientific literature, Biomedica, has been introduced to advance biomedical generalist AI.</li><li>The dataset contains over 6 million scientific articles, 24 million image-text pairs, and 27 metadata fields, including expert human annotations.</li><li>Scalable streaming and search APIs are provided for easy access to the dataset, facilitating seamless integration with AI systems.</li><li>The utility of the Biomedica dataset has been demonstrated through the development of embedding models, chat-style models, and retrieval-augmented chat agents, outperforming previous open systems.</li></ul>

A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI

Discover more