menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

M-DocSum: ...
source image

Arxiv

3d

read

144

img
dot

Image Credit: Arxiv

M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?

  • Researchers investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in document summarization.
  • Existing document understanding benchmarks often assess LVLMs using question-answer formats, which may not guarantee coverage of long-range dependencies.
  • A novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench) is introduced, which includes high-quality arXiv papers with interleaved multimodal summaries aligned with human preferences.
  • Leading LVLMs struggle with coherence, accuracy of information integration, confusion between similar images, and lack of robustness in maintaining coherence and accuracy within long and interleaved contexts.

Read Full Article

like

8 Likes

For uninterrupted reading, download the app