New Framework Minimizes Collateral Damage in LLM Activation Steering

ai-technology · 2026-05-06

A new study on arXiv (2605.01167) formalizes and reduces collateral damage in activation steering for Large Language Models (LLMs). Activation steering modifies LLM behavior by intervening in internal representations to align with a target feature direction. Standard methods like vector addition cause unintended changes in non-target feature directions due to an implicit assumption of isotropy. The authors introduce a constrained optimization framework that finds a new activation minimizing expected squared collateral change, weighted by the empirical second-moment matrix of activations. This nonuniform weighting accounts for varying perturbation costs across feature directions, contrasting with isotropic approaches. The work provides a mathematical formalization of collateral damage and a principled method to mitigate it.

Key facts

arXiv paper 2605.01167
Activation steering controls LLM behavior by intervening in internal representations
Standard vector addition causes collateral damage in non-target feature directions
Collateral damage defined as unintended alignment changes
Standard methods assume isotropy of non-target features
New method models steering as constrained optimization
Minimizes expected squared collateral change weighted by second-moment matrix
Nonuniform weighting encodes varying perturbation costs

New Framework Minimizes Collateral Damage in LLM Activation Steering

Key facts

Entities

Institutions

Sources