<ul><li>Researchers investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in document summarization.</li><li>Existing document understanding benchmarks often assess LVLMs using question-answer formats, which may not guarantee coverage of long-range dependencies.</li><li>A novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench) is introduced, which includes high-quality arXiv papers with interleaved multimodal summaries aligned with human preferences.</li><li>Leading LVLMs struggle with coherence, accuracy of information integration, confusion between similar images, and lack of robustness in maintaining coherence and accuracy within long and interleaved contexts.</li></ul>

M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?

Discover more