<ul><li>Large Language Models (LLMs) are widely used but still vulnerable to jailbreaking attacks despite being aligned using techniques like reinforcement learning from human feedback (RLHF).</li><li>Existing adversarial attack methods on LLMs involve searching for discrete tokens or optimizing the continuous space represented by the model's vocabulary.</li><li>A new intrinsic optimization technique, using exponentiated gradient descent with Bregman projection, has been developed to ensure optimized one-hot encoding stays within the probability simplex.</li><li>This technique has proven to be more effective in jailbreaking LLMs compared to other state-of-the-art techniques, as demonstrated on five LLMs and four datasets.</li></ul>

Adversarial Attack on Large Language Models using Exponentiated Gradient Descent

Discover more