Inference-time intervention (ITI) is a method for steering large language models (LLM) behavior without updating model parameters.
Multi-Attribute Targeted Steering (MAT-Steer) is introduced to handle conflicts in multi-attribute settings by selectively intervening at the token level.
MAT-Steer uses alignment objectives to shift model representations to reduce conflicts between attributes like helpfulness and toxicity.
MAT-Steer outperforms existing ITI and fine-tuning approaches across question answering and generative tasks.