<ul><li>Large Language Models (LLMs) such as GPT-3 have achieved great success in single-turn tasks like summarization.</li><li>However, they struggle with multi-turn tasks like dialogue that require long-term planning.</li><li>To address this, researchers have introduced REgressing the RELative FUture (REFUEL), an efficient policy optimization approach for multi-turn reinforcement learning from human feedback (RLHF) in LLMs.</li><li>REFUEL outperforms state-of-the-art methods like DPO and REBEL, and can match the performance of any policy covered by the training set.</li></ul>

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Discover more