<ul><li>Finetuning large language models (LLMs) introduces critical safety risks as even a few harmful examples can compromise safety alignment.</li><li>Static safety shaping, which updates the model equally on harmful and harmless parts of a response, is deemed suboptimal due to shifting safety context within examples.</li><li>Proposed dynamic safety shaping (DSS) framework reinforces learning from safe segments of a response while suppressing unsafe content by using fine-grained safety signals.</li><li>The Safety Trajectory Assessment of Response (STAR) token-level signal enables shaping to operate dynamically over the training sequence, leading to substantial safety improvements without compromising task capability.</li></ul>

Shape it Up! Restoring LLM Safety during Finetuning

Discover more