Mull-Tokens: Modality-Agnostic Latent Thinking for Multimodal Reasoning
A new AI approach called Mull-Tokens introduces modality-agnostic latent tokens that allow models to reason across text and images without relying on specialist tools or costly image generation. The method pre-trains tokens to hold intermediate information in either modality, enabling free-form reasoning toward correct answers. Inspired by latent reasoning frameworks, the training uses supervision from interleaved text-image traces, followed by fine-tuning using only final answers. The approach is tested on four challenging spatial reasoning benchmarks, demonstrating scalability and robustness compared to existing brittle multimodal models.
Key facts
- Mull-Tokens are modality-agnostic latent tokens pre-trained to hold intermediate information in text or image modalities.
- The method avoids reliance on specialist tools, costly image generation, or handcrafted reasoning data.
- Training uses supervision from interleaved text-image traces, then fine-tunes without supervision using only final answers.
- Evaluated on four challenging spatial reasoning benchmarks.
- The approach is inspired by latent reasoning frameworks.
- Existing multimodal models are described as brittle and not scalable.
- The work is published on arXiv with ID 2512.10941.
- The paper explores best practices for training Mull-Tokens.
Entities
Institutions
- arXiv