Large vision-language models (LVLMs) integrating vision encoders with language models use chain-of-thought (CoT) prompting for multi-modal reasoning.
Existing LVLMs struggle with incorporating the contents of generated rationales in CoT reasoning, impacting grounding and accuracy.
Researchers propose rationale-enhanced decoding (RED) as an inference-time strategy for improved multi-modal CoT reasoning.
Extensive experiments show RED significantly enhances reasoning over standard CoT and other decoding methods in LVLMs, improving faithfulness and accuracy.