Finetuning large language models (LLMs) introduces critical safety risks as even a few harmful examples can compromise safety alignment.
Static safety shaping, which updates the model equally on harmful and harmless parts of a response, is deemed suboptimal due to shifting safety context within examples.
Proposed dynamic safety shaping (DSS) framework reinforces learning from safe segments of a response while suppressing unsafe content by using fine-grained safety signals.
The Safety Trajectory Assessment of Response (STAR) token-level signal enables shaping to operate dynamically over the training sequence, leading to substantial safety improvements without compromising task capability.