Large language models (LLMs) show impressive capabilities in mathematical reasoning.
A new benchmark called Mathematical Topics Tree (MaTT) is introduced to evaluate LLMs on comprehensive mathematical subjects.
GPT-4, the most advanced LLM, achieved only 54% accuracy in the multiple-choice scenario of the MaTT benchmark.
LLMs' performance varied significantly across different mathematical topics, and their explanations were deemed incomplete or inaccurate in many instances.