<ul><li>Effective and reliable control over large language model (LLM) behavior is a significant challenge.</li><li>The existing activation steering methods lack precision and interpretability in influencing model outputs.</li><li>Feature Guided Activation Additions (FGAA) is a novel activation steering method that provides better steering effects with coherence of model outputs.</li><li>FGAA outperforms existing steering methods of CAA, SAE decoder steering, and SAE-TS in steering tasks across various models.</li></ul>

Interpretable Steering of Large Language Models with Feature Guided Activation Additions

Discover more