Multi-modal learning has achieved remarkable success in integrating information from various modalities and outperforming uni-modal approaches in tasks like recognition and retrieval.
However, current methods struggle to address the challenge of generalizing to novel modalities that are unseen during training.
This paper introduces Modality Generalization (MG) to enable models to generalize to unseen modalities.
The authors propose a benchmark and identify key directions for future research to advance robust and adaptable multi-modal models.