Researchers have tested a method for rewriting blocked prompts in text-to-video systems to bypass safety filters without altering their meaning, exposing the frailty of current safeguards.
Various closed-source video models aim to prevent users from generating unwanted content, but determined individuals have found ways to coerce systems into producing restricted material.
A new collaborative effort from Singapore and China has introduced an optimization-based jailbreak method for text-to-video models, successfully tricking systems like Kling by rewriting prompts discreetly.
The method focuses on rewriting prompts that circumvent safety filters while maintaining original meaning, achieved through an iterative optimization process with three objectives.
The approach, tested on platforms like Pika, Luma, Kling, and Open-Sora, outperformed previous methods in breaking system safeguards, highlighting limitations in current safety filters of text-to-video models.
The study conducted by eight researchers from various universities utilized ChatGPT-4o to rewrite prompts and bypass safety filters, showcasing the system's effectiveness in generating prompts that evade detection.
A prompt mutation strategy was implemented to enhance consistency in bypassing filters, leading the system to select prompts that remained effective across multiple uses.
The research methodology aimed to preserve original input meaning while bypassing safety filters, resulting in improved attack success rates and semantic alignment with the original prompts compared to baseline methods.
Notably, Open-Sora exhibited high vulnerability to adversarial prompts, emphasizing the need for improved safety mechanisms in such open-source models to mitigate risks posed by malicious prompts.
The method achieved higher attack success rates and maintained stronger semantic alignment with input prompts across various text-to-video models compared to baseline approaches, demonstrating its efficiency and effectiveness.
The study emphasizes the necessity for advanced safety measures in text-to-video models and suggests that the new method balances attack success with semantic integrity, enhancing the generation of safer content.