Cross-modal embeddings like CLIP, BLIP have shown promise in aligning representations across modalities but may underperform on modality-specific tasks.
Single-modality embeddings excel within their domains but lack cross-modal alignment capabilities.
RP-KrossFuse is proposed as a method to unify cross-modality and single-modality embeddings by integrating them using a random projection-based Kronecker product.
RP-KrossFuse aims to achieve competitive modality-specific performance while preserving cross-modal alignment, demonstrated through numerical experiments combining CLIP embeddings with uni-modal image and text embeddings.