Multimodal Large Language Models (MLLMs) are being utilized for multimodal graph learning, incorporating structured graph information.
MLLMs can enhance graph neural networks (GNNs) through multimodal feature fusion and align multimodal attributes for LLM-based graph reasoning.
There are three paradigms in MMG learning based on MLLM usage: Encoder, Aligner, and Predictor.
Graph-MLLM is introduced as a benchmark for multimodal graph learning, evaluating the three paradigms across six datasets.
Jointly considering visual and textual attributes of nodes improves graph learning, even with pre-trained text-to-image alignment models like CLIP as encoders.
Converting visual attributes into textual descriptions further enhances performance in graph learning compared to using visual inputs directly.
Fine-tuning MLLMs on specific multimodal graphs can achieve top-tier results, even without explicit graph structure information.
The presented benchmark aims to provide a fair evaluation framework for MMG learning and encourage further research in the field.