A new benchmark called ViLP is introduced to investigate the reliance of Vision-Language Models (VLMs) on visual language priors.
ViLP consists of out-of-distribution images and associated Q&A pairs, which require true visual reasoning rather than text priors.
A self-improving framework is proposed to enhance VLM performance by generating new VQA data and applying corruptions to emphasize actual visual inputs.