<ul><li>A new paper proposes a method for training value models on long-context reasoning traces for efficient chain-of-thought reasoning.</li><li>The method does not require a detailed notion of 'step' and uses a dataset of 2.5 million reasoning traces to train a 1.5B token-level value model, improving performance with test-time compute scaling.</li><li>Utilizing block-wise value-guided search with a final weighted majority vote, the approach achieves better test-time scaling compared to standard methods like majority voting or best-of-n.</li><li>With an inference budget of 64 generations, the proposed method reaches an average accuracy of 45.7% across multiple math benchmarks, reducing inference FLOPs required while achieving similar performance as majority voting.</li></ul>

Value-Guided Search for Efficient Chain-of-Thought Reasoning

Discover more