<ul><li>Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of data can compromise safeguards.</li><li>Perturbations along the alignment direction in fine-tuning preserve model safety, while perturbations along orthogonal directions can rapidly degrade safety.</li><li>A methodology called AsFT (Anchoring Safety in Fine-Tuning) is proposed to constrain fine-tuning within a narrow safety basin by suppressing updates in harmful directions.</li><li>Experiments show that AsFT outperforms Safe LoRA, reducing harmful behavior, improving model performance, and maintaining robustness across various settings.</li></ul>

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Discover more