Composition Collapse in LLMs: Stable Facts Don't Ensure Reasoning

ai-technology · 2026-05-27

A recent investigation published on arXiv (2605.26789) indicates that while large language models can reliably retain individual facts, they struggle to integrate these facts into multi-hop reasoning sequences, a situation referred to as 'composition collapse.' The researchers propose a double-gate protocol to assess the remaining composition failures based on stable atomic access, breaking down post-training improvements into atomic stability, residual composition, and critical depth. Their analysis of temporal factual chains, ranging from depths 2 to 11 across four post-training methods, revealed that atomic knowledge, which appears statistically similar, can lead to composition outcomes differing by more than 40 percentage points. This finding suggests that aggregate benchmark scores, which consider multi-hop reasoning as a singular skill, may be misleading.

Key facts

Composition collapse is the systematic failure to assemble stably-known facts into chains.
Statistically indistinguishable atomic knowledge can produce composition behavior separated by over 40 percentage points.
The double-gate protocol changes the estimand from aggregate compositionality gap to residual composition failure conditioned on stable atomic access.
Post-training gains are decomposed into three independent channels: atomic stability, residual composition, and critical depth.
The benchmark uses temporal factual chains spanning depths 2 to 11.
Four post-training recipes were evaluated.
Aggregate benchmark scores are misleading for multi-hop reasoning.
The study is published on arXiv with ID 2605.26789.

Composition Collapse in LLMs: Stable Facts Don't Ensure Reasoning

Key facts

Entities

Institutions

Sources