Existing training-time safety alignment techniques for large language models (LLMs) are vulnerable to jailbreak attacks.
Direct preference optimization (DPO) method proves suboptimal for refusal learning in LLMs.
A new safety alignment approach is proposed that disentangles DPO objectives into robust refusal training and targeted unlearning of harmful knowledge.
This approach enhances LLM robustness against a variety of jailbreak attacks, including prefilling, suffix, and multi-turn attacks in various scenarios.
A reward-based token-level weighting mechanism is introduced to emphasize critical refusal tokens for improved robustness against adversarial exploits.
Robustness to jailbreak attacks is found to be related to token distribution shifts during training and internal representations of refusal and harmful tokens.
The research offers valuable insights for future studies in LLM safety alignment.
The code for the proposed approach is available at https://github.com/wicai24/DOOR-Alignment