menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Adversaria...
source image

Arxiv

3d

read

206

img
dot

Image Credit: Arxiv

Adversarial Attack on Large Language Models using Exponentiated Gradient Descent

  • Large Language Models (LLMs) are widely used but still vulnerable to jailbreaking attacks despite being aligned using techniques like reinforcement learning from human feedback (RLHF).
  • Existing adversarial attack methods on LLMs involve searching for discrete tokens or optimizing the continuous space represented by the model's vocabulary.
  • A new intrinsic optimization technique, using exponentiated gradient descent with Bregman projection, has been developed to ensure optimized one-hot encoding stays within the probability simplex.
  • This technique has proven to be more effective in jailbreaking LLMs compared to other state-of-the-art techniques, as demonstrated on five LLMs and four datasets.

Read Full Article

like

12 Likes

For uninterrupted reading, download the app