The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase.
The study investigates the feasibility of matching vision and language embeddings in an unsupervised manner, without parallel data.
A novel heuristic is introduced to solve the unsupervised matching problem, outperforming previous solvers.
The analysis shows that vision and language representations can be matched without supervision, enabling embedding semantic knowledge into other modalities without annotation.