ARTFEED — Contemporary Art Intelligence

InterSketch: Interleaved Visual-Textual Reasoning Model

ai-technology · 2026-05-27

InterSketch is a new AI model designed to enhance visual reasoning by interleaving visual sketches with textual chain-of-thought. It addresses the shallow, text-centric reasoning of current vision-language models by generating intermediate visual sketches using external tools and integrating them with textual reasoning. The model employs a self-correcting mechanism and stepwise reward to improve long-horizon visual understanding. A cold-start stage uses a synthesized high-quality interleaved VT-CoT dataset with a reflection mechanism. The paper is available on arXiv under ID 2605.26520.

Key facts

  • InterSketch is an interleaved reasoning model for vision-language models.
  • It generates intermediate visual sketches using external tools.
  • The model interleaves visual sketches with textual reasoning.
  • It uses a self-correcting mechanism and stepwise reward.
  • The cold-start stage uses a synthesized interleaved VT-CoT dataset.
  • The dataset includes a reflection mechanism.
  • The paper is on arXiv with ID 2605.26520.
  • The model aims to improve long-horizon visual understanding.

Entities

Institutions

  • arXiv

Sources