<ul><li>Large language models (LLMs) show impressive capabilities in mathematical reasoning.</li><li>A new benchmark called Mathematical Topics Tree (MaTT) is introduced to evaluate LLMs on comprehensive mathematical subjects.</li><li>GPT-4, the most advanced LLM, achieved only 54% accuracy in the multiple-choice scenario of the MaTT benchmark.</li><li>LLMs' performance varied significantly across different mathematical topics, and their explanations were deemed incomplete or inaccurate in many instances.</li></ul>

LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs

Discover more