Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks.
Post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications.
Disagreement-Aware Confidence Alignment (DACA) is a novel unsupervised method proposed to optimize parameters in post-hoc confidence calibration by addressing the scarcity of labeled data for individual downstream tasks.
DACA selectively uses only agreement examples for calibration to improve confidence calibration performance, as demonstrated by extensive experiments showing up to 15.08% improvement in the average ECE of LLMs like GPT-4o on common benchmarks.