Sparse autoencoders (SAEs) and transcoders are important tools for machine learning interpretability.
Measuring the interpretability of SAEs remains challenging due to the lack of consensus on benchmarks.
Current evaluation procedures involve generating single-sentence explanations for each latent, which complicates the assessment process.
A new method has been proposed to assess the interpretability of sparse coders without the need for natural language explanations, aiming for a more direct evaluation approach.