Recent studies suggest that unlearning in large language models is often shallow, as removed knowledge can be easily recovered.
Standard unlearning evaluation practices have limitations, including introducing new information into the model during testing and varying outcomes across tasks.
Many evaluations rely on spurious correlations, impacting the trust and interpretation of their results.
To improve unlearning evaluations, two proposed principles are minimal information injection and downstream task awareness, validated through targeted experiments.