Large language models (LLMs) can be prompted with specific styles, even in jailbreak queries, but the safety impact of these style patterns is unclear.
In a study evaluating 32 LLMs across seven jailbreak benchmarks, it was found that malicious queries with style patterns increased the attack success rate for almost all models.
ASR inflation correlated with the length of style patterns and the attention LLMs placed on them.
The study revealed that fine-tuning LLMs with specific styles made them more vulnerable to jailbreaks of those same styles. A defense strategy called SafeStyle was proposed to mitigate these risks and consistently outperformed baselines in maintaining LLM safety.