ConfPO is a method for preference learning in Large Language Models (LLMs) that optimizes preference-critical tokens based on the training policy's confidence.
ConfPO focuses on optimizing the most impactful tokens, improving alignment quality and mitigating overoptimization compared to prior Direct Alignment Algorithms (DAAs).
ConfPO does not require auxiliary models or additional compute, making it a simple, lightweight, and model-free approach.
Experimental results show that ConfPO consistently outperforms uniform DAAs on various LLMs, achieving better alignment with zero extra computational costs.