Token Correspondence Improves World Model Consistency
A new approach to next-frame prediction in transformer-based world models addresses temporal inconsistency issues like object duplication and disappearance. By formulating prediction as structured probabilistic inference with latent token correspondence variables, the model either copies a token from the previous frame or generates a new one. The method achieves state-of-the-art performance on four benchmarks, including a return of 72.5% and score of 35.6% on Craftax-classic, surpassing previous bests of 67.4% and 27.9%. Source code is released.
Key facts
- Transformer-based world models suffer from temporal inconsistency in long-horizon rollouts.
- Issues include object duplication, disappearance, and transmutation.
- Existing approaches treat next-frame prediction as token generation without temporal correspondence.
- New method models next-frame prediction as structured probabilistic inference with latent token correspondence.
- Each next-frame token is explained by copying from previous frame or generating new token.
- Achieves state-of-the-art on 4 challenging benchmarks.
- Craftax-classic: 72.5% return and 35.6% score (previous best 67.4% and 27.9%).
- Source code is released.
Entities
—