menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Trajectory...
source image

Arxiv

1w

read

128

img
dot

Image Credit: Arxiv

Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

  • Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored.
  • Trajectory Bellman Residual Minimization (TBRM) is introduced as a value-based method for LLM reasoning, adapting the classical paradigm of Bellman Residual Minimization.
  • TBRM optimizes a single trajectory-level Bellman objective using the model's own logits as Q-values, eliminating the need for critics, importance-sampling ratios, or clipping, and operates with only one rollout per prompt.
  • Experiments on mathematical-reasoning benchmarks show that TBRM outperforms policy-based baselines like PPO and GRPO, indicating that value-based RL could be an efficient alternative for enhancing reasoning capabilities in LLMs.

Read Full Article

like

7 Likes

For uninterrupted reading, download the app