Reinforcement learning from human feedback (RLHF) is crucial for ensuring alignment of large language models (LLMs) with human values in the post-training phase.
Group relative policy optimization (GRPO) is an effective approach for RLHF, but efficient training remains a challenge.
Research shows that increasing reward variance in the initial policy model accelerates RLHF training.
A novel reward adjustment model, integrated into the GRPO algorithm as GRPO with reward variance increase (GRPOVI), significantly improves RLHF training efficiency.