menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Adding sim...
source image

Arxiv

2d

read

294

img
dot

Image Credit: Arxiv

Adding simple structure at inference improves Vision-Language Compositionality

  • Dual encoder Vision-Language Models (VLM) like CLIP face challenges with compositionality affecting their retrieval performance.
  • Various training methods have been suggested to enhance the vision-language compositionality of these models.
  • This study focuses on adding simple structure during inference to address the compositionality issue.
  • The proposed method involves dividing images into smaller crops, extracting text segments describing objects, attributes, and relations, and aligning image crops with text segments using a VLM.
  • The final image-text similarity is calculated by aggregating individual similarities of matched image crops and text segments.
  • The approach is evaluated on popular dual encoder VLMs across controlled and natural datasets for vision-language compositionality, showing consistent performance improvements without additional training.
  • Significant enhancements are observed in attribute-object binding, particularly in the controlled dataset.
  • Analysis reveals the importance of processing image crops for performance gains and highlights areas for further improvement in inference-time techniques.

Read Full Article

like

17 Likes

For uninterrupted reading, download the app