Cal-DPO is a new algorithm proposed for aligning large language models (LLMs) with human preference data.
It addresses the limitation of the contrastive preference optimization by calibrating the implicit reward to ensure comparability with ground-truth rewards.
Cal-DPO demonstrates theoretical advantages and significantly improves off-the-shelf methods in aligning LLMs with given preferences.