A new paper proposes a method for training value models on long-context reasoning traces for efficient chain-of-thought reasoning.
The method does not require a detailed notion of 'step' and uses a dataset of 2.5 million reasoning traces to train a 1.5B token-level value model, improving performance with test-time compute scaling.
Utilizing block-wise value-guided search with a final weighted majority vote, the approach achieves better test-time scaling compared to standard methods like majority voting or best-of-n.
With an inference budget of 64 generations, the proposed method reaches an average accuracy of 45.7% across multiple math benchmarks, reducing inference FLOPs required while achieving similar performance as majority voting.