OpenAI o1 Can’t Do Maths, But Excels at Making Excuses

A naukri.com initiative

New

Home

Data Science News

OpenAI o1 ...

Analyticsindiamag

155

Image Credit: Analyticsindiamag

OpenAI o1 Can’t Do Maths, But Excels at Making Excuses

Epoch AI released a new benchmark named FrontierMath to evaluate the mathematical capabilities of LLMs, it revealed a startling low for these systems where LLMs are years behind human intelligence. LLM's significantly fail to solve problems that are harder to solve but have been tested on other benchmarks such as Omni-MATH, MathVista, and GSM8-K. Researchers tested LLMs on mathematical benchmarks that being harder to perform on may be a better way to assess their overall capabilities. Folding a fair evaluation, the researchers tested the LLMs repeatedly on four of the problems that they all solved correctly. Epoch AI's future plans include developing more such tests and implementing other methods for better assessment.
Even OpenAI mentioned that they do not want to benchmark o1 on MATH and GSM8K since the evaluation method is quite outdated, and most LLMs will easily output high scores.
The test problems contained integer-based answers, and the solutions were automatically verified using Python scripts. Besides, Epoch AI claims that these problems were “guess proof”, which means that all of the problems had to be fully solved to arrive at the solution.
Moreover, the problems in the benchmark are all new and unpublished, alleviating any concerns of ‘contamination’ from existing benchmarks. They were developed in collaboration with 60 mathematicians.
Several mathematicians praised the benchmark and indicated that it contained one of the most complex sets of problems.
AI skeptics: LLMs are copy-paste engines, incapable of original thought, may find this benchmark helpful.
o1 Did Win an Important Challenge- While OpenAI claims that o1 is the best LLM to date, it did not perform well on the mathematical benchmark. However, when re-evaluating these problems that were solved at least once, o1-preview demonstrated the strongest performance across repeated trials.” said Epoch AI in the research paper.
Epoch AI’s future plans include developing more such tests and implementing other methods for better assessment.
Assessing these models on such tough benchmarks isn’t everything. Karpathy says that it’s an interesting challenge to create evals for all the ‘easy’ stuff that is secretly hard.
To perform a fair evaluation, the researchers tested the LLMs repeatedly on four of the problems that they all solved correctly.

Read Full Article

9 Likes

Discover more

For uninterrupted reading, download the app