menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Offline Re...
source image

Arxiv

3d

read

36

img
dot

Image Credit: Arxiv

Offline Reinforcement Learning for LLM Multi-Step Reasoning

  • Offline reinforcement learning (RL) is proposed to improve the multi-step reasoning ability of large language models (LLMs).
  • The method called OREO (Offline Reasoning Optimization) jointly learns a policy model and value function by optimizing the soft Bellman Equation.
  • OREO reduces the need to collect pairwise data and enables better credit assignment in multi-step reasoning tasks.
  • Empirically, OREO surpasses existing offline learning methods on multi-step reasoning benchmarks.

Read Full Article

like

2 Likes

For uninterrupted reading, download the app