Model distillation is crucial for creating smaller language models while maintaining performance, but there are concerns about vulnerability to adversarial bias injection.
Adversaries can introduce biases into teacher models through data poisoning, which then amplify in student models, leading to biased responses.
Two propagation modes are identified: Untargeted Propagation affecting multiple tasks and Targeted Propagation focusing on specific tasks.
The study highlights security vulnerabilities in distilled models and suggests the need for specialized safeguards and mitigation strategies.