<ul><li>Cross-modal embeddings like CLIP, BLIP have shown promise in aligning representations across modalities but may underperform on modality-specific tasks.</li><li>Single-modality embeddings excel within their domains but lack cross-modal alignment capabilities.</li><li>RP-KrossFuse is proposed as a method to unify cross-modality and single-modality embeddings by integrating them using a random projection-based Kronecker product.</li><li>RP-KrossFuse aims to achieve competitive modality-specific performance while preserving cross-modal alignment, demonstrated through numerical experiments combining CLIP embeddings with uni-modal image and text embeddings.</li></ul>

Fusing Cross-modal and Uni-modal Representations: A Kronecker Product Approach

Discover more