InterSketch: Interleaved Visual-Textual Reasoning Model

ai-technology · 2026-05-27

InterSketch is a new AI model designed to enhance visual reasoning by interleaving visual sketches with textual chain-of-thought. It addresses the shallow, text-centric reasoning of current vision-language models by generating intermediate visual sketches using external tools and integrating them with textual reasoning. The model employs a self-correcting mechanism and stepwise reward to improve long-horizon visual understanding. A cold-start stage uses a synthesized high-quality interleaved VT-CoT dataset with a reflection mechanism. The paper is available on arXiv under ID 2605.26520.

Key facts

InterSketch is an interleaved reasoning model for vision-language models.
It generates intermediate visual sketches using external tools.
The model interleaves visual sketches with textual reasoning.
It uses a self-correcting mechanism and stepwise reward.
The cold-start stage uses a synthesized interleaved VT-CoT dataset.
The dataset includes a reflection mechanism.
The paper is on arXiv with ID 2605.26520.
The model aims to improve long-horizon visual understanding.

InterSketch: Interleaved Visual-Textual Reasoning Model

Key facts

Entities

Institutions

Sources