Ideal text-to-image (T2I) retrievers should prioritize specific visual attributes relevant to queries.
CLIP-like retrievers have poor performance on attribute-focused queries due to focusing on global semantics and subjects, leaving out other details.
Recent Multimodal Large Language Model (MLLM)-based retrievers also struggle with limitations in handling attribute-focused queries.
Proposal to use promptable image embeddings to boost performance by highlighting required attributes, with acceleration strategies to enhance real-world applicability.