Epoch AI released FrontierMath, a new benchmark to evaluate mathematical capabilities of large language models.LLMs' poor performance on mathematical assessments indicates that they are falling behind human intelligence.The low performance of LLMs in mathematical reasoning poses several questions regarding their effectiveness and output.FrontierMath contains new and complex problems developed in collaboration with 60 mathematicians.LLMs' overall performance on FrontierMath only solves 2% of the problems correctly.The test problems contained integer-based answers, and the solutions were automatically verified using Python scripts.O1 Preview performed the strongest among repeated trials compared to other LLMs on the benchmark.Epoch AI's future plans include developing more such tests and implementing other methods for better assessment.Assessing such models on tough benchmarks is not everything, and other easy-eval tests must also be developed.FrontierMath is a unique assessment tool that requires lengthy and precise reasoning to solve the problems.