Multimodal Large Language Models (MLLMs) are powerful at integrating diverse data but struggle with complex reasoning.
Reinforcement learning (RL) can boost reasoning in LLMs, but applying it to MLLMs is challenging due to issues like a drop in performance on general tasks and overthinking reasoning.
A new approach called Asymmetric Policy Optimization (APO) is proposed to enhance the reasoning abilities of MLLMs by addressing issues related to KL penalty, overthinking, and overly detailed responses.
The application of APO to a specific MLLM model (View-R1-3B) resulted in a significant 7% gain in reasoning capabilities over the base model and outperformed larger MLLMs on various reasoning benchmarks while maintaining consistency across general tasks.