ARTFEED — Contemporary Art Intelligence

Sparse Autoencoders Reveal Semantic Features Driving Brain-LLM Alignment

ai-technology · 2026-05-25

A team of researchers from MIT and Harvard employed sparse autoencoders (SAEs) to analyze GPT-2 XL and Llama-3.1-8B, breaking them down into 16K-32K interpretable features for each layer. They discovered that semantic features alone account for 94% of peak encoding performance (r=0.285) when predicting human brain responses to language. A taxonomy validated by humans (κ ≥ 0.74) demonstrated that these semantic features significantly outperform variance-matched baselines (p<0.001, d=1.31). Additionally, the study explored a novel cortical topography prediction, revealing that five semantic subcategories from three independent neuroscience programs correspond to specific brain regions, with a formal convergence test affirming this alignment (Spearman ρ=0.72, p<0.001; hypergeometric p=0.007). This research connects mechanistic interpretability with neural encoding models, offering insights into why intermediate LLM layers most accurately predict brain activity.

Key facts

  • Sparse autoencoders (SAEs) decompose GPT-2 XL and Llama-3.1-8B into 16K-32K interpretable features per layer.
  • Semantic features alone recover 94% of peak encoding performance (r=0.285) in predicting brain responses.
  • Human-validated taxonomy achieves κ ≥ 0.74.
  • Semantic features substantially exceed variance-matched baselines (p<0.001, d=1.31).
  • Five semantic subcategories derived from three independent neuroscience programs map onto distinct brain regions.
  • Convergence test confirms alignment with Spearman ρ=0.72, p<0.001; hypergeometric p=0.007.
  • Study bridges mechanistic interpretability with neural encoding models.
  • Provides mechanistic explanation for why intermediate LLM layers best predict brain activity.

Entities

Institutions

  • MIT
  • Harvard

Sources