<ul><li>Model distillation is crucial for creating smaller language models while maintaining performance, but there are concerns about vulnerability to adversarial bias injection.</li><li>Adversaries can introduce biases into teacher models through data poisoning, which then amplify in student models, leading to biased responses.</li><li>Two propagation modes are identified: Untargeted Propagation affecting multiple tasks and Targeted Propagation focusing on specific tasks.</li><li>The study highlights security vulnerabilities in distilled models and suggests the need for specialized safeguards and mitigation strategies.</li></ul>

Cascading Adversarial Bias from Injection to Distillation in Language Models

Discover more