A new study introduces Modality-Balancing Preference Optimization (MBPO) to address modality imbalance in Large Multimodal Models (LMMs).
MBPO generates hard negatives to counter biases in Large Language Model (LLM) backbones and incorporates online responses with verified rewards using Group Relative Policy Optimization (GRPO).
The method aims to improve reasoning capabilities in LMMs and reduce hallucinations by balancing language prior biases over visual inputs.
Experiments show that MBPO enhances performance on vision-language tasks and effectively combats modality imbalance in LMMs.