menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

ConfPO: Ex...
source image

Arxiv

1d

read

199

img
dot

Image Credit: Arxiv

ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Large Language Model Preference Optimization

  • ConfPO is a method for preference learning in Large Language Models (LLMs) that optimizes preference-critical tokens based on the training policy's confidence.
  • ConfPO focuses on optimizing the most impactful tokens, improving alignment quality and mitigating overoptimization compared to prior Direct Alignment Algorithms (DAAs).
  • ConfPO does not require auxiliary models or additional compute, making it a simple, lightweight, and model-free approach.
  • Experimental results show that ConfPO consistently outperforms uniform DAAs on various LLMs, achieving better alignment with zero extra computational costs.

Read Full Article

like

11 Likes

For uninterrupted reading, download the app