Apple AI researchers have found limitations in large reasoning models' ability to handle complex problems, challenging assumptions about artificial general intelligence (AGI) capabilities.
Authors tested reasoning models like OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking in a puzzle environment to evaluate performance.
Reasoning models performed better with moderately complex problems but struggled with higher complexity, indicating a threshold beyond which their performance collapsed.
The study suggests that current approaches using large reasoning models may face significant obstacles in achieving generalizable reasoning capabilities, raising questions about the advancement towards AGI.