<ul><li>Activation steering offers an alternative for controlling Large Language Model (LLM) behaviors at inference time without the need for costly fine-tuning.</li><li>A lightweight, trainable controller network is introduced to dynamically modulate the intensity of a steering patch across the LLM's layers during generation.</li><li>The controller network predicts a global scaling factor and layer-specific weights to apply nuanced, layer-aware interventions primarily for harmful inputs.</li><li>Experiments show that this weighted steering controller significantly increases refusal rates compared to the base LLM, offering an efficient method for fine-grained control over LLM behavior.</li></ul>

Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs

Discover more