ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

A naukri.com initiative

New

ViCrit: A ...

Arxiv

344

Image Credit: Arxiv

Reinforcement learning (RL) has been successful in improving large language models (LLMs) using tasks like math reasoning or code generation.
However, applying RL to visual perception in vision-language models (VLMs) has been challenging due to the lack of difficult yet easily verifiable vision-centric tasks.
ViCrit (Visual Caption Hallucination Critic) is introduced as an RL proxy task for training VLMs to locate subtle visual errors injected into human-written image captions.
The task involves injecting a minor visual description error into paragraphs, tasking the model to identify the corrupted span based on the caption and the image.
ViCrit emphasizes providing a binary, exact-match reward that enhances perceptual difficulty while being easy to compute and clear in meaning.
Models trained with the ViCrit task show significant improvements across various vision-language benchmarks, including abstract image reasoning and visual math.
The improvements from ViCrit extend beyond natural-image training data, indicating progress towards perceiving rather than just memorizing objects.
ViCrit-Bench is introduced as a diagnostic benchmark to evaluate perception errors across different image domains and error types in a balanced manner.
Results suggest that detailed hallucination criticism is an effective and transferable objective for enhancing visual perception in VLMs.

Read Full Article

20 Likes

For uninterrupted reading, download the app