<ul><li>Multimodal embeddings combine visual and textual data into a single representational space, enabling systems to understand and relate images and language meaningfully.</li><li>A new model called VLM2VEC and a comprehensive benchmark named MMEB have been introduced by researchers from Salesforce Research and the University of Waterloo.</li><li>The VLM2VEC framework allows any vision-language model to handle input combinations of text and images while following task instructions, improving its generalization capabilities.</li><li>Performance results demonstrate that VLM2VEC achieved high scores across various tasks, outperforming baseline models and demonstrating the effectiveness of contrastive training and task-specific instruction embedding.</li></ul>

This AI Paper from Salesforce Introduces VLM2VEC and MMEB: A Contrastive Framework and Benchmark for Universal Multimodal Embeddings

Discover more