Offline reinforcement learning (RL) is proposed to improve the multi-step reasoning ability of large language models (LLMs).The method called OREO (Offline Reasoning Optimization) jointly learns a policy model and value function by optimizing the soft Bellman Equation.OREO reduces the need to collect pairwise data and enables better credit assignment in multi-step reasoning tasks.Empirically, OREO surpasses existing offline learning methods on multi-step reasoning benchmarks.