Multi-modal learning has achieved remarkable success by integrating information from various modalities, surpassing uni-modal approaches in tasks like recognition and retrieval.
Challenges arise in real-world scenarios when encountering novel modalities unseen during training, due to resource and privacy constraints, currently not adequately addressed by existing methods.
This paper introduces Modality Generalization (MG) to enhance model generalization to unseen modalities, defining Weak MG and Strong MG cases and proposing a benchmark for assessment.
Experiments reveal the complexity of MG, highlight limitations of current methods, and suggest key research directions for developing more adaptable multi-modal models to handle unseen modalities.