menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Interpreta...
source image

Arxiv

1w

read

213

img
dot

Image Credit: Arxiv

Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations

  • Sparse autoencoders (SAEs) are often used to interpret large language models by mapping them to human-interpretable concept representations.
  • Existing evaluations of SAEs have focused on metrics like reconstruction-sparsity trade-off, interpretability, and feature disentanglement, but have overlooked robustness to input perturbations.
  • Researchers argue that the robustness of concept representations is critical for fidelity of concept labeling.
  • Empirical studies show that tiny adversarial input perturbations can manipulate SAE-based interpretations without significantly affecting the outputs of the base language models, suggesting fragility in SAE concept representations.

Read Full Article

like

12 Likes

For uninterrupted reading, download the app