Reinforcement Learning from Human Feedback (RLHF) is a popular method for controlling language model outputs but has high computational costs and training instability.
Value-guided decoding offers a cost-effective alternative for controlling outputs without re-training models.
However, accurate estimation of the optimal value function is crucial for effective value-guided decoding.
The proposed Iterative Value Function Optimization framework addresses these limitations through Monte Carlo Value Estimation and Iterative On-Policy Optimization, leading to efficient and effective control of language models.