Existing benchmarks primarily assess passive reasoning abilities of large language models (LLMs), providing all necessary information.
A new benchmark called AR-Bench is introduced to evaluate LLMs' active reasoning skills by requiring interaction with external systems to acquire missing evidence.
AR-Bench comprises task families like detective cases, situation puzzles, and guessing numbers to measure performance across various reasoning challenges.
Empirical evaluation on AR-Bench shows that current LLMs struggle with active reasoning, indicating a need for advancing methodology to enhance their capabilities.