New Framework Extends LLM Interventions to Non-Linear Features
A new framework for intervening on large language models (LLMs) goes beyond the linear representation hypothesis, enabling manipulation of features encoded along non-linear manifolds. The method, introduced in a paper on arXiv, includes a learning procedure that allows intervention on implicit features without direct output signatures. Validated on refusal bypass steering, the approach steers models more precisely than linear baselines by targeting a non-linear feature governing refusal.
Key facts
- Intervention is a widely used method for understanding LLM internal representations.
- Existing intervention methods are limited to linear interventions based on the Linear Representation Hypothesis.
- The new framework extends intervention to non-linearly represented features.
- The framework includes a learning procedure for intervening on implicit features lacking direct output signatures.
- Validation was performed on refusal bypass steering.
- The method steers models more precisely than linear baselines.
- The intervention targets a non-linear feature governing refusal.
- The paper is available on arXiv.
Entities
Institutions
- arXiv