menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

AsFT: Anch...
source image

Arxiv

5d

read

195

img
dot

Image Credit: Arxiv

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

  • Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of data can compromise safeguards.
  • Perturbations along the alignment direction in fine-tuning preserve model safety, while perturbations along orthogonal directions can rapidly degrade safety.
  • A methodology called AsFT (Anchoring Safety in Fine-Tuning) is proposed to constrain fine-tuning within a narrow safety basin by suppressing updates in harmful directions.
  • Experiments show that AsFT outperforms Safe LoRA, reducing harmful behavior, improving model performance, and maintaining robustness across various settings.

Read Full Article

like

11 Likes

For uninterrupted reading, download the app