Linear Steering Fails to Correct Decodable LLM Failures in Medical QA

ai-technology · 2026-05-09

A recent study published on arXiv (2605.05715) indicates that although failure signals in large language models (LLMs) can be linearly decoded from their hidden states, fixed linear steering techniques are ineffective in rectifying these failures. The investigation centers on "Overthinking" (OT) in medical question answering, where models perform well under resampling but struggle during prolonged chain-of-thought reasoning. OT achieves a linear decodability of 71.6% balanced accuracy (p < 10^{-16}), yet 29 configurations of fixed linear steering across 1,273 trials show no enhancement (Delta ~= 0). Findings are consistent across different architectures (Qwen2.5-7B) and domains (MMLU-STEM), revealing representational entanglement as a fundamental limitation in current LLM interpretability methods.

Key facts

arXiv paper 2605.05715 investigates the classification-correction gap in LLMs.
Overthinking (OT) is a stable behavioral regime in medical QA with Jaccard >= 0.81 and 94% inter-annotator agreement.
OT is linearly decodable at 71.6% balanced accuracy (p < 10^{-16}).
Five families of fixed linear steering (29 configurations, n=1,273) yield Delta ~= 0.
Null results are cross-architecture (Qwen2.5-7B) and cross-domain (MMLU-STEM).
The OT direction has 85-88% overlap with task-critical computation (specificity ratio <= 0.152).
Non-targeted shared-direction steering damages accuracy by -12.1 percentage points.
LEACE concept erasure damages accuracy by -3.6pp (p=0.01), while 10 random erasures produce Delta=+0.3pp.

Linear Steering Fails to Correct Decodable LLM Failures in Medical QA

Key facts

Entities

Institutions

Sources