Visual understanding is inherently contextual, and what we focus on in an image depends on the task at hand.
FocalLens is a conditional visual encoding method that produces different representations for the same image based on the context of interest, expressed through natural language instructions.
FocalLens outperforms generic visual encoders by better highlighting the visual features of interest.
FocalLens improves performance on various downstream tasks, including image-image retrieval, image classification, and image-text retrieval.