Activation steering offers an alternative for controlling Large Language Model (LLM) behaviors at inference time without the need for costly fine-tuning.
A lightweight, trainable controller network is introduced to dynamically modulate the intensity of a steering patch across the LLM's layers during generation.
The controller network predicts a global scaling factor and layer-specific weights to apply nuanced, layer-aware interventions primarily for harmful inputs.
Experiments show that this weighted steering controller significantly increases refusal rates compared to the base LLM, offering an efficient method for fine-grained control over LLM behavior.