Inference-time scaling can enhance the reasoning capabilities of large language models (LLMs) on complex problems.
This research investigates the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks.
The advantages of inference-time scaling vary across tasks and diminish as problem complexity increases.
Results show that, for some tasks, conventional models can achieve performance close to advanced reasoning models, but for other tasks, a performance gap remains.