<ul><li>The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase.</li><li>The study investigates the feasibility of matching vision and language embeddings in an unsupervised manner, without parallel data.</li><li>A novel heuristic is introduced to solve the unsupervised matching problem, outperforming previous solvers.</li><li>The analysis shows that vision and language representations can be matched without supervision, enabling embedding semantic knowledge into other modalities without annotation.</li></ul>

It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Discover more