Sparse Autoencoders Reveal Semantic Features Driving Brain-LLM Alignment

ai-technology · 2026-05-25

A team of researchers from MIT and Harvard employed sparse autoencoders (SAEs) to analyze GPT-2 XL and Llama-3.1-8B, breaking them down into 16K-32K interpretable features for each layer. They discovered that semantic features alone account for 94% of peak encoding performance (r=0.285) when predicting human brain responses to language. A taxonomy validated by humans (κ ≥ 0.74) demonstrated that these semantic features significantly outperform variance-matched baselines (p<0.001, d=1.31). Additionally, the study explored a novel cortical topography prediction, revealing that five semantic subcategories from three independent neuroscience programs correspond to specific brain regions, with a formal convergence test affirming this alignment (Spearman ρ=0.72, p<0.001; hypergeometric p=0.007). This research connects mechanistic interpretability with neural encoding models, offering insights into why intermediate LLM layers most accurately predict brain activity.

Key facts

Sparse autoencoders (SAEs) decompose GPT-2 XL and Llama-3.1-8B into 16K-32K interpretable features per layer.
Semantic features alone recover 94% of peak encoding performance (r=0.285) in predicting brain responses.
Human-validated taxonomy achieves κ ≥ 0.74.
Semantic features substantially exceed variance-matched baselines (p<0.001, d=1.31).
Five semantic subcategories derived from three independent neuroscience programs map onto distinct brain regions.
Convergence test confirms alignment with Spearman ρ=0.72, p<0.001; hypergeometric p=0.007.
Study bridges mechanistic interpretability with neural encoding models.
Provides mechanistic explanation for why intermediate LLM layers best predict brain activity.

Sparse Autoencoders Reveal Semantic Features Driving Brain-LLM Alignment

Key facts

Entities

Institutions

Sources