<ul><li>Offline reinforcement learning (RL) is proposed to improve the multi-step reasoning ability of large language models (LLMs).</li><li>The method called OREO (Offline Reasoning Optimization) jointly learns a policy model and value function by optimizing the soft Bellman Equation.</li><li>OREO reduces the need to collect pairwise data and enables better credit assignment in multi-step reasoning tasks.</li><li>Empirically, OREO surpasses existing offline learning methods on multi-step reasoning benchmarks.</li></ul>

Offline Reinforcement Learning for LLM Multi-Step Reasoning

Discover more