The rise of vision foundation models (VFMs) has led to the need for systematic evaluation.
Pairing VFMs with large language models (LLMs) for evaluation on Visual Question Answering (VQA) benchmarks is a common approach, but it has blind spots.
AVA-Bench is introduced as the first benchmark disentangling 14 Atomic Visual Abilities (AVAs) to address evaluation gaps.
AVA-Bench focuses on foundational skills like localization, depth estimation, and spatial understanding that support visual reasoning tasks.
The benchmark decouples AVAs and matches training and test distributions to pinpoint VFM strengths and weaknesses.
AVA-Bench helps in revealing distinct 'ability fingerprints' of leading VFMs, improving selection accuracy.
A 0.5B LLM performs similarly in VFM rankings as a 7B LLM but reduces GPU hours by 8x for more efficient evaluation.
AVA-Bench aims to offer a transparent benchmark for the next generation of VFMs.