GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO is a framework proposed for training reward models efficiently in Reinforcement Learning from Human Feedback (RLHF).
The framework introduces data augmentation and expansion techniques to train generative reward models on small datasets effectively.
Preference refinement, Chain-of-Thought (CoT) sampling, perplexity-based scoring, and Multi-level Direct Preference Optimization (M-DPO) are key components of this framework.
Experimental results show that the proposed method enhances data efficiency and model performance, enabling few-shot trained reward models to perform comparably to those trained on large-scale datasets.