<ul data-eligibleForWebStory="true"><li>Existing training-time safety alignment techniques for large language models (LLMs) are vulnerable to jailbreak attacks.</li><li>Direct preference optimization (DPO) method proves suboptimal for refusal learning in LLMs.</li><li>A new safety alignment approach is proposed that disentangles DPO objectives into robust refusal training and targeted unlearning of harmful knowledge.</li><li>This approach enhances LLM robustness against a variety of jailbreak attacks, including prefilling, suffix, and multi-turn attacks in various scenarios.</li><li>A reward-based token-level weighting mechanism is introduced to emphasize critical refusal tokens for improved robustness against adversarial exploits.</li><li>Robustness to jailbreak attacks is found to be related to token distribution shifts during training and internal representations of refusal and harmful tokens.</li><li>The research offers valuable insights for future studies in LLM safety alignment.</li><li>The code for the proposed approach is available at https://github.com/wicai24/DOOR-Alignment</li></ul>

Improving LLM Safety Alignment with Dual-Objective Optimization

Discover more