A new algorithm named Retrospective Replay-based Reinforcement Learning (RRL) has been proposed to improve RL exploration for large language models (LLMs).
During the early stages of training, LLMs exhibit strong exploratory capabilities, but are limited in their ability to solve complex problems.
RRL introduces a dynamic replay mechanism throughout the training process, allowing the model to revisit and re-explore promising states identified in the early stages.
Experimental results show that RRL significantly enhances the effectiveness of RL in optimizing LLMs for complicated reasoning tasks.