<ul data-eligibleForWebStory="true"><li>Multimodal Large Language Models (MLLMs) are powerful at integrating diverse data but struggle with complex reasoning.</li><li>Reinforcement learning (RL) can boost reasoning in LLMs, but applying it to MLLMs is challenging due to issues like a drop in performance on general tasks and overthinking reasoning.</li><li>A new approach called Asymmetric Policy Optimization (APO) is proposed to enhance the reasoning abilities of MLLMs by addressing issues related to KL penalty, overthinking, and overly detailed responses.</li><li>The application of APO to a specific MLLM model (View-R1-3B) resulted in a significant 7% gain in reasoning capabilities over the base model and outperformed larger MLLMs on various reasoning benchmarks while maintaining consistency across general tasks.</li></ul>

APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization

Discover more