Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of data can compromise safeguards.
Perturbations along the alignment direction in fine-tuning preserve model safety, while perturbations along orthogonal directions can rapidly degrade safety.
A methodology called AsFT (Anchoring Safety in Fine-Tuning) is proposed to constrain fine-tuning within a narrow safety basin by suppressing updates in harmful directions.
Experiments show that AsFT outperforms Safe LoRA, reducing harmful behavior, improving model performance, and maintaining robustness across various settings.