Dual encoder Vision-Language Models (VLM) like CLIP face challenges with compositionality affecting their retrieval performance.
Various training methods have been suggested to enhance the vision-language compositionality of these models.
This study focuses on adding simple structure during inference to address the compositionality issue.
The proposed method involves dividing images into smaller crops, extracting text segments describing objects, attributes, and relations, and aligning image crops with text segments using a VLM.
The final image-text similarity is calculated by aggregating individual similarities of matched image crops and text segments.
The approach is evaluated on popular dual encoder VLMs across controlled and natural datasets for vision-language compositionality, showing consistent performance improvements without additional training.
Significant enhancements are observed in attribute-object binding, particularly in the controlled dataset.
Analysis reveals the importance of processing image crops for performance gains and highlights areas for further improvement in inference-time techniques.