menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Modeling C...
source image

Arxiv

2d

read

362

img
dot

Image Credit: Arxiv

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

  • Llip, Latent Language Image Pretraining, is introduced to model the diversity of captions that could match an image.
  • Llip's vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text.
  • Llip outperforms non-contextualized baselines like CLIP and SigLIP on various tasks, including zero-shot classification and retrieval.
  • Llip achieves a zero-shot top-1 accuracy of 83.5% on ImageNet, outperforming similarly sized CLIP by 1.4%.

Read Full Article

like

21 Likes

For uninterrupted reading, download the app