Memory Tokens Prove Essential for Universal Transformer Reasoning
A recent study published on arXiv (2604.21999) reveals that learned memory tokens are essential for a single-block Universal Transformer with Adaptive Computation Time to tackle Sudoku-Extreme, a benchmark for combinatorial reasoning. In every configuration tested—spanning three seeds, various token counts, two initialization methods, and both ACT and fixed-depth processing—none without memory tokens demonstrated significant performance. The ideal number of memory tokens reveals a distinct lower limit: T=0 consistently fails, T=4 is marginally effective, while T=8 reliably solves 81-cell puzzles. Performance stabilizes from T=8 to T=32 (57.4% ± 0.7% exact-match accuracy) before deteriorating at T=64 due to attention dilution. The study also uncovered a router initialization issue that led to over 70% of training runs failing, with both zero-bias (p ~ 0.5) and Graves' positive bias (p ~ 0.73) initializations causing tokens to stall after about 2 steps.
Key facts
- Memory tokens are empirically necessary for Universal Transformer performance on Sudoku-Extreme.
- Optimal token count shows a sharp lower threshold: T=0 fails, T=4 borderline, T=8 succeeds.
- Stable plateau from T=8 to T=32 yields 57.4% ± 0.7% exact-match accuracy.
- Performance collapses at T=64 due to attention dilution.
- Router initialization trap causes >70% of training runs to fail.
- Default zero-bias initialization (p ~ 0.5) and Graves' positive bias (p ~ 0.73) both cause early halting.
- Study used 3 seeds, multiple token counts, two initialization schemes, ACT and fixed-depth processing.
- Research published on arXiv with identifier 2604.21999.
Entities
Institutions
- arXiv