Large Language Models (LLMs) are widely used but still vulnerable to jailbreaking attacks despite being aligned using techniques like reinforcement learning from human feedback (RLHF).
Existing adversarial attack methods on LLMs involve searching for discrete tokens or optimizing the continuous space represented by the model's vocabulary.
A new intrinsic optimization technique, using exponentiated gradient descent with Bregman projection, has been developed to ensure optimized one-hot encoding stays within the probability simplex.
This technique has proven to be more effective in jailbreaking LLMs compared to other state-of-the-art techniques, as demonstrated on five LLMs and four datasets.