Adversarial threats against LLMs are evolving faster than current defenses can adapt, showing a critical geometric blind spot in alignment.
Introducing ALKALI, a benchmark with 9,000 prompts across various attack families to assess the vulnerability of 21 leading LLMs, highlighting high Attack Success Rates (ASRs).
To address the vulnerability of latent camouflage, GRACE - Geometric Representation Aware Contrastive Enhancement is introduced, reducing ASR by up to 39% through preference learning and latent space regularization.
AVQI, a geometry-aware metric, is introduced to quantify latent alignment failure by measuring cluster separation and compactness, providing insights into how models encode safety internally.