Sparse Autoencoders (SAEs) are explored as a lightweight, interpretable alternative for bug detection in Java functions to address software vulnerabilities such as buffer overflows and SQL injections.
SAEs are proposed as a solution to the challenges posed by the complexity and opacity of Large Language Models (LLMs) in vulnerability detection and secure code generation.
Evaluation shows that SAE-derived features enable bug detection with an F1 score of up to 89%, outperforming fine-tuned transformer encoder baselines.
This study provides empirical evidence that SAEs can detect software bugs directly from the internal representations of pretrained LLMs without requiring fine-tuning or task-specific supervision.