Recent advancements in multimodal slow-thinking systems have shown impressive performance in visual reasoning tasks but lack systematic benchmarking for text-rich image reasoning tasks.
OCR-Reasoning benchmark has been introduced to evaluate Multimodal Large Language Models (MLLMs) on text-rich image reasoning tasks, comprising 1,069 human-annotated examples covering various reasoning abilities and tasks.
Unlike other benchmarks, OCR-Reasoning not only annotates final answers but also reasoning processes simultaneously, enabling a holistic evaluation of model problem-solving abilities.
Evaluation of state-of-the-art MLLMs using OCR-Reasoning reveals significant challenges, with no model achieving accuracy above 50%, highlighting the pressing need to address difficulties in text-rich image reasoning.