HDMI: Probe-Free Causal Probing for LLMs
A new method called Hidden-state Driven Margin Intervention (HDMI) allows causal probing of large language models without training auxiliary classifiers. HDMI uses gradient-based steering to modify hidden states directly via the model's native output, applying a margin objective to increase target continuation probability while decreasing source probability. A lookahead variant (LA-HDMI) enables text editing by backpropagating through softmax embeddings. The approach avoids misalignment issues common in probe-based methods. The paper is available on arXiv under ID 2605.07631.
Key facts
- HDMI is a probe-free, gradient-based causal probing technique.
- It uses a margin objective to steer hidden states.
- LA-HDMI variant allows text editing via softmax backpropagation.
- The method avoids auxiliary probe classifiers.
- Paper available at arXiv:2605.07631.
- Causal probing tests how internal representations influence model behavior.
- Existing methods rely on trained probe classifiers.
- HDMI directly uses the model's native output.
Entities
Institutions
- arXiv