OpenAI’s o3 model shows promising results through its test-time scaling method but this scaling comes at a higher cost. The model scored high on a difficult math test, which no other model had scored more than 2% on. However, the logarithmic x-axis on the chart shows that the o3 model used more than $1,000 worth of compute for every given task analyzed. OpenAI is either using more computer chips to answer a user’s question, running more powerful inference chips, or running those chips for a longer period of time before the AI produces an answer.
The creators of o3 expect that this trajectory of improved performance will continue. However, the performance comes at a cost and the increase in performance also raises new questions around usage and costs. Test-time scaling has been at the forefront of AI gaining momentum in terms of scaling, but there is concern around the elevated expense of usage.
The high-scoring version of o3 used more than $10,000 in resources to complete a difficult test. This makes it too expensive to be considered effective for full-scale commercial usage. It seems like AI models with scaled test-time compute may only be good for large scale projects, as organizations and institutions with deep pockets may be the only ones that can afford o3.
The o3 model is capable of adapting to tasks it has never encountered and arguably approaches human-level performance, as is evidenced by its performance on the ARC-AGI benchmark. Although, it still fails on some very easy tasks that humans can do quite easily and it is not AGI.
Although the high costs of o3, as it pertains to test-time scaling, can be a disincentive, there is still enthusiasm around the potential for the technology. Given that many companies view AI as a competitive advantage, it’s clear that test-time compute is the next step to scaling AI models.
OpenAI has revealed that the cost of running AI systems is less predictable, given the ability to utilize test-time compute. Until now companies could predict the cost of serving a generative model, but that has become more difficult given the computational needs of test-time compute.
Investors expect progress in AI to be faster next year than last year. They predict that the AI world will splice together test-time scaling and traditional pre-training scaling methods to create better AI models and improve their performance.
The o3 model adds credibility to the claim that test-time compute is the tech industry's next best way to scale AI models. Although it is expensive, the model is capable of achieving unique adaptations and performance milestones.
There may be potential for more gains in test-time scaling through the design of better AI inference chips. There are a number of start-ups working in this space who could play a larger role in test-time scaling moving forward.
While the o3 model is a notable improvement to the performance of AI models, it raises questions around usage and costs. Organizations with deep pockets may be the only ones that can currently afford o3, but given industry momentum around test-time compute, it's clear that the use of such expensive models is set to increase in the coming years.