A new benchmark for large language models (LLMs) has been developed, which focuses on general knowledge rather than specialized 'PhD-level' knowledge.
The benchmark consists of 594 problems based on the NPR Sunday Puzzle Challenge and is challenging for both humans and models.
OpenAI o1 outperforms other reasoning models on the benchmark, revealing capability gaps in existing benchmarks.
The analysis of reasoning outputs exposes new types of failures in models, such as conceding with 'I give up' before providing known incorrect answers.