Token Correspondence Improves World Model Consistency

ai-technology · 2026-05-20

A new approach to next-frame prediction in transformer-based world models addresses temporal inconsistency issues like object duplication and disappearance. By formulating prediction as structured probabilistic inference with latent token correspondence variables, the model either copies a token from the previous frame or generates a new one. The method achieves state-of-the-art performance on four benchmarks, including a return of 72.5% and score of 35.6% on Craftax-classic, surpassing previous bests of 67.4% and 27.9%. Source code is released.

Key facts

Transformer-based world models suffer from temporal inconsistency in long-horizon rollouts.
Issues include object duplication, disappearance, and transmutation.
Existing approaches treat next-frame prediction as token generation without temporal correspondence.
New method models next-frame prediction as structured probabilistic inference with latent token correspondence.
Each next-frame token is explained by copying from previous frame or generating new token.
Achieves state-of-the-art on 4 challenging benchmarks.
Craftax-classic: 72.5% return and 35.6% score (previous best 67.4% and 27.9%).
Source code is released.

Entities

—

Sources

arXiv cs.AI — 2026-05-19