ARTFEED — Contemporary Art Intelligence

Mull-Tokens: Modality-Agnostic Latent Thinking for Multimodal Reasoning

ai-technology · 2026-05-01

A new AI approach called Mull-Tokens introduces modality-agnostic latent tokens that allow models to reason across text and images without relying on specialist tools or costly image generation. The method pre-trains tokens to hold intermediate information in either modality, enabling free-form reasoning toward correct answers. Inspired by latent reasoning frameworks, the training uses supervision from interleaved text-image traces, followed by fine-tuning using only final answers. The approach is tested on four challenging spatial reasoning benchmarks, demonstrating scalability and robustness compared to existing brittle multimodal models.

Key facts

  • Mull-Tokens are modality-agnostic latent tokens pre-trained to hold intermediate information in text or image modalities.
  • The method avoids reliance on specialist tools, costly image generation, or handcrafted reasoning data.
  • Training uses supervision from interleaved text-image traces, then fine-tunes without supervision using only final answers.
  • Evaluated on four challenging spatial reasoning benchmarks.
  • The approach is inspired by latent reasoning frameworks.
  • Existing multimodal models are described as brittle and not scalable.
  • The work is published on arXiv with ID 2512.10941.
  • The paper explores best practices for training Mull-Tokens.

Entities

Institutions

  • arXiv

Sources