<ul><li>Large language models (LLMs) are being deployed in high-stakes settings, but generating harmful or toxic content is a major concern.</li><li>A data-centric pretraining framework is proposed to address this challenge by incorporating safety measures from the start.</li><li>Key contributions include a safety classifier, a large synthetic safety dataset, and Harmfulness-Tag annotations to flag unsafe content.</li><li>The safety-pretrained models successfully reduce attack success rates and maintain performance on standard LLM safety benchmarks.</li></ul>

Safety Pretraining: Toward the Next Generation of Safe AI

Discover more