Transformers Build Internal World Models Aligned with Sudoku's Constraint Structure
A new study from arXiv (2605.18847) reveals that transformers trained on sequential reasoning traces develop internal world models that mirror the constraint algebra of the domain, not its surface presentation. Researchers trained an 8-layer transformer on Sudoku solving traces and performed mechanistic analysis. They found the model organizes information around rows, columns, and boxes—the structural units of Sudoku's constraints—rather than representing the board cell by cell. Additionally, they identified a 'naked-single circuit': dedicated neurons in the final MLP layer that detect when only one digit remains possible for a cell and reliably promote that digit. This demonstrates that emergent world models in transformers are shaped by the underlying task structure.
Key facts
- Study published on arXiv with ID 2605.18847
- 8-layer transformer trained on Sudoku solving traces
- Model builds substructure world model organized by rows, columns, and boxes
- Naked-single circuit found in final MLP layer
- Dedicated neurons detect single possible digit per cell
- World model geometry shaped by constraint algebra
- Not by surface presentation of the board
- Mechanistic analysis of internal computation performed
Entities
Institutions
- arXiv