menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

ViCrit: A ...
source image

Arxiv

2d

read

344

img
dot

Image Credit: Arxiv

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

  • Reinforcement learning (RL) has been successful in improving large language models (LLMs) using tasks like math reasoning or code generation.
  • However, applying RL to visual perception in vision-language models (VLMs) has been challenging due to the lack of difficult yet easily verifiable vision-centric tasks.
  • ViCrit (Visual Caption Hallucination Critic) is introduced as an RL proxy task for training VLMs to locate subtle visual errors injected into human-written image captions.
  • The task involves injecting a minor visual description error into paragraphs, tasking the model to identify the corrupted span based on the caption and the image.
  • ViCrit emphasizes providing a binary, exact-match reward that enhances perceptual difficulty while being easy to compute and clear in meaning.
  • Models trained with the ViCrit task show significant improvements across various vision-language benchmarks, including abstract image reasoning and visual math.
  • The improvements from ViCrit extend beyond natural-image training data, indicating progress towards perceiving rather than just memorizing objects.
  • ViCrit-Bench is introduced as a diagnostic benchmark to evaluate perception errors across different image domains and error types in a balanced manner.
  • Results suggest that detailed hallucination criticism is an effective and transferable objective for enhancing visual perception in VLMs.

Read Full Article

like

20 Likes

For uninterrupted reading, download the app