<ul><li>Recent advancements in multimodal slow-thinking systems have shown impressive performance in visual reasoning tasks but lack systematic benchmarking for text-rich image reasoning tasks.</li><li>OCR-Reasoning benchmark has been introduced to evaluate Multimodal Large Language Models (MLLMs) on text-rich image reasoning tasks, comprising 1,069 human-annotated examples covering various reasoning abilities and tasks.</li><li>Unlike other benchmarks, OCR-Reasoning not only annotates final answers but also reasoning processes simultaneously, enabling a holistic evaluation of model problem-solving abilities.</li><li>Evaluation of state-of-the-art MLLMs using OCR-Reasoning reveals significant challenges, with no model achieving accuracy above 50%, highlighting the pressing need to address difficulties in text-rich image reasoning.</li></ul>

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Discover more