<ul><li>Epoch AI released FrontierMath, a new benchmark to evaluate mathematical capabilities of large language models.</li><li>LLMs' poor performance on mathematical assessments indicates that they are falling behind human intelligence.</li><li>The low performance of LLMs in mathematical reasoning poses several questions regarding their effectiveness and output.</li><li>FrontierMath contains new and complex problems developed in collaboration with 60 mathematicians.</li><li>LLMs' overall performance on FrontierMath only solves 2% of the problems correctly.</li><li>The test problems contained integer-based answers, and the solutions were automatically verified using Python scripts.</li><li>O1 Preview performed the strongest among repeated trials compared to other LLMs on the benchmark.</li><li>Epoch AI's future plans include developing more such tests and implementing other methods for better assessment.</li><li>Assessing such models on tough benchmarks is not everything, and other easy-eval tests must also be developed.</li><li>FrontierMath is a unique assessment tool that requires lengthy and precise reasoning to solve the problems.</li></ul>

Never Mind Coding—o1 is Downright Awful at Maths!

Discover more