<ul><li>ConfPO is a method for preference learning in Large Language Models (LLMs) that optimizes preference-critical tokens based on the training policy's confidence.</li><li>ConfPO focuses on optimizing the most impactful tokens, improving alignment quality and mitigating overoptimization compared to prior Direct Alignment Algorithms (DAAs).</li><li>ConfPO does not require auxiliary models or additional compute, making it a simple, lightweight, and model-free approach.</li><li>Experimental results show that ConfPO consistently outperforms uniform DAAs on various LLMs, achieving better alignment with zero extra computational costs.</li></ul>

ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Large Language Model Preference Optimization

Discover more