Shojaee et al. (2025) found that Large Reasoning Models (LRMs) face 'accuracy collapse' on planning puzzles beyond certain complexity thresholds.
The study argues that the reported failures primarily stem from experimental design issues rather than inherent reasoning deficiencies.
Key issues identified include Tower of Hanoi experiments exceeding model output limits, leading to failure despite acknowledging these constraints.
The automated evaluation system fails to differentiate between reasoning failures and practical limitations, resulting in misjudgment of model abilities.
Authors note that River Crossing benchmarks feature mathematically unsolvable instances for N > 5 due to boat capacity constraints, yet models are marked as failures for not solving these problems.
When experimental artifacts are addressed by requesting generating functions instead of exhaustive move lists, preliminary tests suggest high accuracy on Tower of Hanoi instances previously deemed as complete failures.
The study underscores the significance of meticulous experimental design in the assessment of AI reasoning proficiency.