Test-time scaling aims to enhance Large Language Models (LLMs) reasoning by utilizing more compute at inference time and enabling extrapolation for improved performance on challenging problems.
Existing reasoning models generally do not extrapolate well, but one way to enable extrapolation is by training the LLM to engage in in-context exploration.
In-context exploration involves training the LLM to appropriately utilize its test time by chaining operations and testing multiple hypotheses before providing an answer.
The proposed recipe e3 includes chaining skills, leveraging negative gradients, and coupling task difficulty with training token budget to enable in-context exploration, resulting in improved performance and extrapolation for Large Language Models.