Large Language Models (LLMs) such as GPT-3 have achieved great success in single-turn tasks like summarization.
However, they struggle with multi-turn tasks like dialogue that require long-term planning.
To address this, researchers have introduced REgressing the RELative FUture (REFUEL), an efficient policy optimization approach for multi-turn reinforcement learning from human feedback (RLHF) in LLMs.
REFUEL outperforms state-of-the-art methods like DPO and REBEL, and can match the performance of any policy covered by the training set.