<ul><li>VELOCITI is a benchmark created to study Video-LLMs and assess compositional reasoning in short videos.</li><li>It disentangles and evaluates the comprehension of agents, actions, and their associations across multiple events.</li><li>Current video models like LLaVA-OneVision and Gemini-1.5-Pro perform far from human accuracy in classifying positive and negative captions.</li><li>The benchmark highlights challenges with ClassicVLE and multiple-choice evaluation, emphasizing the preference for StrictVLE.</li></ul>

VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment

Discover more