Despite being a simple task in combinatorial mathematics, major AI platforms failed to verify a 12-team round-robin tournament schedule accurately after numerous attempts.
The AI systems collectively valued over $100B in VC funding, including Claude, Grok, ChatGPT, and DeepSeek, exhibited various failures like hallucinated duplicates, invalid same-team flags, and false success declarations.
The failures included issues like claiming error-free schedules while duplicates remained, pattern recognition breakdowns, and memoryless iteration, requiring human intervention for verification.
The case study highlights that current advanced AI systems struggle to perform basic combinatorial verification without human assistance, as demonstrated by Mr. McKenzie's manual verification protocol outperforming billion-dollar AIs.