<ul data-eligibleForWebStory="true">LRMs are advanced Large Language Models that can perform step-by-step reasoning through Chain-of-Thought prompting.Models like DeepSeek-R1 using reinforcement learning marked a shift towards reasoning-focused mechanisms.Existing benchmarks used to assess LRMs are criticized for data contamination and misleading performance measures.Authors suggest using structured puzzle environments for better evaluation, such as Tower of Hanoi, Checker Jumping, etc.Three performance regimes are identified based on complexity levels, revealing strengths and weaknesses of LRMs.LRMs often underperform traditional LLMs in low complexity tasks but excel in medium complexity tasks with reasoning traces.In high complexity tasks, both LLMs and LRMs struggle, with LRMs even reducing reasoning efforts despite unused resources.LRMs exhibit overthinking behavior by exploring incorrect alternatives for simple tasks and struggle to decide when to push further.Models tend to 'give up' on harder tasks, reducing reasoning depth despite remaining token budgets, revealing architectural limitations.The study challenges the idea that scaling model size and data alone can lead to better generalization, indicating a need for improved architecture.Current models are criticized for lacking true reasoning abilities and instead relying on pattern reuse, hindering progress towards AGI.Paper questions existing metrics for measuring machine intelligence and suggests emphasizing creativity and genuine understanding in AI development.