<ul data-eligibleForWebStory="false"><li>Researchers have identified limitations in current reward models in reinforcement learning from human feedback.</li><li>To address these limitations, they introduced a large-scale preference dataset named SynPref-40M.</li><li>A two-stage pipeline involving human annotations and AI scalability was designed to curate the data.</li><li>Their Skywork-Reward-V2 suite of eight reward models shows state-of-the-art performance across various benchmarks.</li></ul>

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Discover more