AURELIA is a novel actor-critic based audio-visual reasoning framework that improves the ability of AVLLMs to process complex multi-modal inputs without additional training.
AVReasonBench is a challenging benchmark with 4500 audio-visual questions and detailed step-by-step reasoning, evaluating the reasoning skills of AVLLMs.
Evaluation of 18 AVLLMs on AVReasonBench reveals limitations in their multi-modal reasoning capabilities.
Using AURELIA, a relative improvement of up to 100% is achieved, highlighting the potential of reasoning-enhanced data generation for advancing AVLLMs.