Multimodal Large Language Models (MLLMs) are vulnerable to jailbreak attacks where harmful content can be induced, posing safety risks despite safety alignment efforts.
A new method named FC-Attack utilizes auto-generated flowcharts with partially harmful information to trick MLLMs into providing additional harmful details.
FC-Attack fine-tunes a pre-trained model to create a step-description generator from benign datasets, then transforms harmful queries into flowcharts for the attack.
The flowcharts come in vertical, horizontal, and S-shaped forms, combined with benign text prompts to execute the attack on MLLMs, achieving high success rates.
Evaluations on Advbench demonstrate FC-Attack's success rates of up to 96% via images and up to 78% via videos across various MLLMs.
Factors affecting the attack performance, such as the number of steps and font styles in the flowcharts, are investigated, with font style changes improving success rates.
FC-Attack enhances jailbreak performance from 4% to 28% in Claude-3.5 by altering font styles.
Several defense mechanisms, including AdaShield, help mitigate the attack; however, they may come at the cost of reduced utility.