Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored.
Trajectory Bellman Residual Minimization (TBRM) is introduced as a value-based method for LLM reasoning, adapting the classical paradigm of Bellman Residual Minimization.
TBRM optimizes a single trajectory-level Bellman objective using the model's own logits as Q-values, eliminating the need for critics, importance-sampling ratios, or clipping, and operates with only one rollout per prompt.
Experiments on mathematical-reasoning benchmarks show that TBRM outperforms policy-based baselines like PPO and GRPO, indicating that value-based RL could be an efficient alternative for enhancing reasoning capabilities in LLMs.