menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Prompt Att...
source image

Arxiv

3d

read

389

img
dot

Image Credit: Arxiv

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

  • Some machine unlearning methods might be susceptible to prompt attacks as per a recent study.
  • Eight unlearning techniques were evaluated across different model families revealing vulnerabilities in ELM.
  • Methods like RMU and TAR showed robust unlearning capabilities.
  • Specific prompt attacks, such as using Hindi filler text, could recover up to 57.3% accuracy in ELM models.
  • Logit analysis confirmed that unlearned models generally do not hide knowledge by altering output formatting.
  • The study challenges existing assumptions about the effectiveness of unlearning methods.
  • There is a need for better evaluation frameworks to distinguish between true knowledge removal and superficial output suppression.
  • The study authors have made their evaluation framework publicly available for assessing prompting techniques in retrieving unlearning knowledge.

Read Full Article

like

23 Likes

For uninterrupted reading, download the app