menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Improving ...
source image

Arxiv

1d

read

165

img
dot

Image Credit: Arxiv

Improving LLM Safety Alignment with Dual-Objective Optimization

  • Existing training-time safety alignment techniques for large language models (LLMs) are vulnerable to jailbreak attacks.
  • Direct preference optimization (DPO) method proves suboptimal for refusal learning in LLMs.
  • A new safety alignment approach is proposed that disentangles DPO objectives into robust refusal training and targeted unlearning of harmful knowledge.
  • This approach enhances LLM robustness against a variety of jailbreak attacks, including prefilling, suffix, and multi-turn attacks in various scenarios.
  • A reward-based token-level weighting mechanism is introduced to emphasize critical refusal tokens for improved robustness against adversarial exploits.
  • Robustness to jailbreak attacks is found to be related to token distribution shifts during training and internal representations of refusal and harmful tokens.
  • The research offers valuable insights for future studies in LLM safety alignment.
  • The code for the proposed approach is available at https://github.com/wicai24/DOOR-Alignment

Read Full Article

like

9 Likes

For uninterrupted reading, download the app