Llip, Latent Language Image Pretraining, is introduced to model the diversity of captions that could match an image.
Llip's vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text.
Llip outperforms non-contextualized baselines like CLIP and SigLIP on various tasks, including zero-shot classification and retrieval.
Llip achieves a zero-shot top-1 accuracy of 83.5% on ImageNet, outperforming similarly sized CLIP by 1.4%.