Researchers investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in document summarization.
Existing document understanding benchmarks often assess LVLMs using question-answer formats, which may not guarantee coverage of long-range dependencies.
A novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench) is introduced, which includes high-quality arXiv papers with interleaved multimodal summaries aligned with human preferences.
Leading LVLMs struggle with coherence, accuracy of information integration, confusion between similar images, and lack of robustness in maintaining coherence and accuracy within long and interleaved contexts.