Reinforcement learning (RL) has been successful in improving large language models (LLMs) using tasks like math reasoning or code generation.
However, applying RL to visual perception in vision-language models (VLMs) has been challenging due to the lack of difficult yet easily verifiable vision-centric tasks.
ViCrit (Visual Caption Hallucination Critic) is introduced as an RL proxy task for training VLMs to locate subtle visual errors injected into human-written image captions.
The task involves injecting a minor visual description error into paragraphs, tasking the model to identify the corrupted span based on the caption and the image.
ViCrit emphasizes providing a binary, exact-match reward that enhances perceptual difficulty while being easy to compute and clear in meaning.
Models trained with the ViCrit task show significant improvements across various vision-language benchmarks, including abstract image reasoning and visual math.
The improvements from ViCrit extend beyond natural-image training data, indicating progress towards perceiving rather than just memorizing objects.
ViCrit-Bench is introduced as a diagnostic benchmark to evaluate perception errors across different image domains and error types in a balanced manner.
Results suggest that detailed hallucination criticism is an effective and transferable objective for enhancing visual perception in VLMs.