Fully Looped Transformer Stabilizes Training Without Extra Parameters

ai-technology · 2026-05-20

A new paper on arXiv proposes the Fully Looped Transformer, a modification to the Looped Transformer architecture that addresses training instability. The instability arises from gradient oscillation and residual explosion when increasing loop iterations. The authors introduce two parameter-free modifications: a Fully Looped Architecture that distributes inter-loop signals across all layers to mitigate residual explosion, and Attention Injection that reuses existing attention mechanisms. This approach allows scaling performance through additional computation without increasing model size or context length, and enables dynamic adjustment of loop iterations at inference to balance performance and test-time compute. The paper is available at arXiv:2605.18797.

Key facts

arXiv:2605.18797
Looped Transformer suffers from training instability with increased loop iterations
Instability stems from gradient oscillation and residual explosion
Fully Looped Transformer introduces two parameter-free modifications
Fully Looped Architecture distributes inter-loop signals across all layers
Attention Injection reuses existing attention mechanisms
Loop iterations can be adjusted at inference
No increase in parameter count or context length

Fully Looped Transformer Stabilizes Training Without Extra Parameters

Key facts

Entities

Institutions

Sources