<ul data-eligibleForWebStory="false">Researchers have identified limitations in current reward models in reinforcement learning from human feedback.To address these limitations, they introduced a large-scale preference dataset named SynPref-40M.A two-stage pipeline involving human annotations and AI scalability was designed to curate the data.Their Skywork-Reward-V2 suite of eight reward models shows state-of-the-art performance across various benchmarks.