<ul><li>Reinforcement Learning from Human Feedback (RLHF) is a popular method for controlling language model outputs but has high computational costs and training instability.</li><li>Value-guided decoding offers a cost-effective alternative for controlling outputs without re-training models.</li><li>However, accurate estimation of the optimal value function is crucial for effective value-guided decoding.</li><li>The proposed Iterative Value Function Optimization framework addresses these limitations through Monte Carlo Value Estimation and Iterative On-Policy Optimization, leading to efficient and effective control of language models.</li></ul>

Iterative Value Function Optimization for Guided Decoding

Discover more