ARTFEED — Contemporary Art Intelligence

Research Reveals Asymmetric Dynamics in AI Hallucination Through Causal Analysis

ai-technology · 2026-04-20

A recent study offers causal insights into how hallucinations in autoregressive language models arise, attributing this to early trajectory commitment influenced by asymmetric attractor dynamics. Researchers utilized same-prompt bifurcation, sampling identical inputs to track spontaneous divergence, thereby separating trajectory dynamics from prompt-level variables. Testing Qwen2.5-1.5B across 61 prompts in six categories revealed that 27 prompts (44.3%) bifurcated, with factual and hallucinated trajectories diverging at the first token generated. Activation patching across 28 layers demonstrated significant causal asymmetry: introducing a hallucinated activation into a correct trajectory led to output corruption in 87.5% of trials at layer 20, while the reverse yielded recovery in only 33.3% at layer 24. Both outcomes surpassed the 10.4% baseline and 12.5% random-patch control, achieving statistical significance at p = 0.025. Window patching indicated that correction demands sustained multi-step intervention, while corruption requires less effort. Documented in arXiv preprint 2604.15400v1, this research sheds light on the emergence and persistence of hallucinations in transformer generation, emphasizing the challenges of reversing a committed hallucinated path compared to inducing errors, thus enhancing our understanding of AI reliability and error mechanisms in language models.

Key facts

  • Hallucination in autoregressive language models is linked to early trajectory commitment governed by asymmetric attractor dynamics.
  • Same-prompt bifurcation isolates trajectory dynamics from prompt-level confounds by repeatedly sampling identical inputs.
  • On Qwen2.5-1.5B, 27 out of 61 prompts (44.3%) bifurcated with trajectories diverging at the first generated token.
  • Activation patching shows injecting hallucinated activation into a correct trajectory corrupts output in 87.5% of trials at layer 20.
  • Reversing hallucination by injecting correct activation recovers only 33.3% at layer 24.
  • Both corruption and recovery rates exceed the 10.4% baseline and 12.5% random-patch control, with p = 0.025.
  • Window patching indicates correction requires sustained multi-step intervention, while corruption is easier to induce.
  • The study is documented in arXiv preprint 2604.15400v1, using causal methods to analyze transformer generation.

Entities

Institutions

  • arXiv

Sources