<ul><li>AURELIA is a novel actor-critic based audio-visual reasoning framework that improves the ability of AVLLMs to process complex multi-modal inputs without additional training.</li><li>AVReasonBench is a challenging benchmark with 4500 audio-visual questions and detailed step-by-step reasoning, evaluating the reasoning skills of AVLLMs.</li><li>Evaluation of 18 AVLLMs on AVReasonBench reveals limitations in their multi-modal reasoning capabilities.</li><li>Using AURELIA, a relative improvement of up to 100% is achieved, highlighting the potential of reasoning-enhanced data generation for advancing AVLLMs.</li></ul>

Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

Discover more