Multimodal embeddings combine visual and textual data into a single representational space, enabling systems to understand and relate images and language meaningfully.
A new model called VLM2VEC and a comprehensive benchmark named MMEB have been introduced by researchers from Salesforce Research and the University of Waterloo.
The VLM2VEC framework allows any vision-language model to handle input combinations of text and images while following task instructions, improving its generalization capabilities.
Performance results demonstrate that VLM2VEC achieved high scores across various tasks, outperforming baseline models and demonstrating the effectiveness of contrastive training and task-specific instruction embedding.