<ul><li>A new benchmark for large language models (LLMs) has been developed, which focuses on general knowledge rather than specialized 'PhD-level' knowledge.</li><li>The benchmark consists of 594 problems based on the NPR Sunday Puzzle Challenge and is challenging for both humans and models.</li><li>OpenAI o1 outperforms other reasoning models on the benchmark, revealing capability gaps in existing benchmarks.</li><li>The analysis of reasoning outputs exposes new types of failures in models, such as conceding with 'I give up' before providing known incorrect answers.</li></ul>

PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models

Discover more