The paper discusses challenges in machine learning safety introduced by multimodal large language models, with a focus on Audio-Language Models (ALMs).
It explores audio jailbreaks targeting ALMs, showing the first universal jailbreaks in the audio modality that can bypass alignment mechanisms and remain effective in simulated real-world conditions.
The research reveals that adversarial perturbations encode imperceptible toxic speech, suggesting that embedding linguistic features within audio signals can elicit toxic outputs.
The study highlights the importance of understanding interactions between modalities in multimodal models and provides insights to improve defenses against adversarial audio attacks.