<ul><li>GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO is a framework proposed for training reward models efficiently in Reinforcement Learning from Human Feedback (RLHF).</li><li>The framework introduces data augmentation and expansion techniques to train generative reward models on small datasets effectively.</li><li>Preference refinement, Chain-of-Thought (CoT) sampling, perplexity-based scoring, and Multi-level Direct Preference Optimization (M-DPO) are key components of this framework.</li><li>Experimental results show that the proposed method enhances data efficiency and model performance, enabling few-shot trained reward models to perform comparably to those trained on large-scale datasets.</li></ul>

GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO

Discover more