Mull-Tokens: Modality-Agnostic Latent Thinking for Multimodal Reasoning

ai-technology · 2026-05-01

A new AI approach called Mull-Tokens introduces modality-agnostic latent tokens that allow models to reason across text and images without relying on specialist tools or costly image generation. The method pre-trains tokens to hold intermediate information in either modality, enabling free-form reasoning toward correct answers. Inspired by latent reasoning frameworks, the training uses supervision from interleaved text-image traces, followed by fine-tuning using only final answers. The approach is tested on four challenging spatial reasoning benchmarks, demonstrating scalability and robustness compared to existing brittle multimodal models.

Key facts

Mull-Tokens are modality-agnostic latent tokens pre-trained to hold intermediate information in text or image modalities.
The method avoids reliance on specialist tools, costly image generation, or handcrafted reasoning data.
Training uses supervision from interleaved text-image traces, then fine-tunes without supervision using only final answers.
Evaluated on four challenging spatial reasoning benchmarks.
The approach is inspired by latent reasoning frameworks.
Existing multimodal models are described as brittle and not scalable.
The work is published on arXiv with ID 2512.10941.
The paper explores best practices for training Mull-Tokens.

Mull-Tokens: Modality-Agnostic Latent Thinking for Multimodal Reasoning

Key facts

Entities

Institutions

Sources