menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Adversaria...
source image

Arxiv

1d

read

271

img
dot

Image Credit: Arxiv

AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)

  • Adversarial threats against LLMs are evolving faster than current defenses can adapt, showing a critical geometric blind spot in alignment.
  • Introducing ALKALI, a benchmark with 9,000 prompts across various attack families to assess the vulnerability of 21 leading LLMs, highlighting high Attack Success Rates (ASRs).
  • To address the vulnerability of latent camouflage, GRACE - Geometric Representation Aware Contrastive Enhancement is introduced, reducing ASR by up to 39% through preference learning and latent space regularization.
  • AVQI, a geometry-aware metric, is introduced to quantify latent alignment failure by measuring cluster separation and compactness, providing insights into how models encode safety internally.

Read Full Article

like

16 Likes

For uninterrupted reading, download the app