Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models with human preferences.This paper explores data-driven bottlenecks in RLHF performance scaling, focusing on reward hacking and decreasing response diversity.The hybrid reward system, combining reasoning task verifiers (RTV) and a generative reward model (GenRM), is introduced to mitigate reward hacking.The novel prompt-selection method, Pre-PPO, is proposed to maintain response diversity and enhance learning effectiveness.