Sparse autoencoders (SAEs) are often used to interpret large language models by mapping them to human-interpretable concept representations.
Existing evaluations of SAEs have focused on metrics like reconstruction-sparsity trade-off, interpretability, and feature disentanglement, but have overlooked robustness to input perturbations.
Researchers argue that the robustness of concept representations is critical for fidelity of concept labeling.
Empirical studies show that tiny adversarial input perturbations can manipulate SAE-based interpretations without significantly affecting the outputs of the base language models, suggesting fragility in SAE concept representations.