Reinforcement learning has been used to enhance the reasoning capabilities of large language models, where an LLM generator is guided by a verifier.
Current RL post-training methods for LLMs often use fixed or discriminatively trained verifiers, which have limitations in reward hacking and generalization.
To address these issues, the Tango framework concurrently trains both an LLM generator and a process-level LLM verifier using RL in an interleaved manner.
The generative RL-trained verifier in Tango shows improved robustness and generalization, leading to state-of-the-art performance on math benchmarks and reasoning tasks.