<ul><li>Sparse autoencoders (SAEs) are often used to interpret large language models by mapping them to human-interpretable concept representations.</li><li>Existing evaluations of SAEs have focused on metrics like reconstruction-sparsity trade-off, interpretability, and feature disentanglement, but have overlooked robustness to input perturbations.</li><li>Researchers argue that the robustness of concept representations is critical for fidelity of concept labeling.</li><li>Empirical studies show that tiny adversarial input perturbations can manipulate SAE-based interpretations without significantly affecting the outputs of the base language models, suggesting fragility in SAE concept representations.</li></ul>

Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations

Discover more