Memory Tokens Prove Essential for Universal Transformer Reasoning

ai-technology · 2026-04-27

A recent study published on arXiv (2604.21999) reveals that learned memory tokens are essential for a single-block Universal Transformer with Adaptive Computation Time to tackle Sudoku-Extreme, a benchmark for combinatorial reasoning. In every configuration tested—spanning three seeds, various token counts, two initialization methods, and both ACT and fixed-depth processing—none without memory tokens demonstrated significant performance. The ideal number of memory tokens reveals a distinct lower limit: T=0 consistently fails, T=4 is marginally effective, while T=8 reliably solves 81-cell puzzles. Performance stabilizes from T=8 to T=32 (57.4% ± 0.7% exact-match accuracy) before deteriorating at T=64 due to attention dilution. The study also uncovered a router initialization issue that led to over 70% of training runs failing, with both zero-bias (p ~ 0.5) and Graves' positive bias (p ~ 0.73) initializations causing tokens to stall after about 2 steps.

Key facts

Memory tokens are empirically necessary for Universal Transformer performance on Sudoku-Extreme.
Optimal token count shows a sharp lower threshold: T=0 fails, T=4 borderline, T=8 succeeds.
Stable plateau from T=8 to T=32 yields 57.4% ± 0.7% exact-match accuracy.
Performance collapses at T=64 due to attention dilution.
Router initialization trap causes >70% of training runs to fail.
Default zero-bias initialization (p ~ 0.5) and Graves' positive bias (p ~ 0.73) both cause early halting.
Study used 3 seeds, multiple token counts, two initialization schemes, ACT and fixed-depth processing.
Research published on arXiv with identifier 2604.21999.

Memory Tokens Prove Essential for Universal Transformer Reasoning

Key facts

Entities

Institutions

Sources