menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

When Style...
source image

Arxiv

2d

read

7

img
dot

Image Credit: Arxiv

When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment

  • Large language models (LLMs) can be prompted with specific styles, even in jailbreak queries, but the safety impact of these style patterns is unclear.
  • In a study evaluating 32 LLMs across seven jailbreak benchmarks, it was found that malicious queries with style patterns increased the attack success rate for almost all models.
  • ASR inflation correlated with the length of style patterns and the attention LLMs placed on them.
  • The study revealed that fine-tuning LLMs with specific styles made them more vulnerable to jailbreaks of those same styles. A defense strategy called SafeStyle was proposed to mitigate these risks and consistently outperformed baselines in maintaining LLM safety.

Read Full Article

like

Like

For uninterrupted reading, download the app