OpenAI's o3 model scores 75.7% on the super-difficult ARC-AGI benchmark test of novel tasks and fluid intelligence, while a high-compute version reached 87.5%, setting a new benchmark for AI reasoning.
ARC-AGI benchmark test is widely considered as one of the crucial challenges measuring the AI's abilities to perform the abstract reasoning required for the code to artificial general intelligence (AGI).
o3 is capable of adapting to tasks it has never encountered before, demonstrating a level of novel task adaptation ability never seen before in the GPT models family, marking a qualitative shift in AI capabilities.
The key to solving novel problems is 'program synthesis' where a thinking system can develop small programs for solving specific problems and combine them to tackle complex problems; o3 possibly uses chain-of-thinking (CoT) reasoning coupled with search mechanism and reward model.
However, there is very little information about how o3 works under the hood; scientists have diverging opinions, and there is an ongoing debate on whether the laws of scaling LLMs through training data and compute have hit a wall.
The name ARC-AGI is misleading because passing ARC-AGI does not equate to achieving AGI; o3 still fails some very easy tasks, indicating fundamental differences with human intelligence.
OpenAI's o3 model is fine-tuned on the ARC training set, and the solver should not require much specific training; verification is needed to measure o3's abstraction and reasoning abilities.
Francois Chollet and his team are currently working on a new benchmark that matches the novel capabilities of o3, which can potentially reduce o3's score under 30% even at a high-compute budget.
Despite its impressive score on the ARC-AGI benchmark, humans can solve 95% of the puzzles without any training.
It is worth noting that o3 uses a significant amount of compute and token, costing between $17-20 and 33 million tokens to solve each puzzle, and it is expected to reduce as the costs of inference continue to decrease.