<ul><li>Visual understanding is inherently contextual, and what we focus on in an image depends on the task at hand.</li><li>FocalLens is a conditional visual encoding method that produces different representations for the same image based on the context of interest, expressed through natural language instructions.</li><li>FocalLens outperforms generic visual encoders by better highlighting the visual features of interest.</li><li>FocalLens improves performance on various downstream tasks, including image-image retrieval, image classification, and image-text retrieval.</li></ul>

FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations

Discover more