InterSketch: Interleaved Visual-Textual Reasoning Model
InterSketch is a new AI model designed to enhance visual reasoning by interleaving visual sketches with textual chain-of-thought. It addresses the shallow, text-centric reasoning of current vision-language models by generating intermediate visual sketches using external tools and integrating them with textual reasoning. The model employs a self-correcting mechanism and stepwise reward to improve long-horizon visual understanding. A cold-start stage uses a synthesized high-quality interleaved VT-CoT dataset with a reflection mechanism. The paper is available on arXiv under ID 2605.26520.
Key facts
- InterSketch is an interleaved reasoning model for vision-language models.
- It generates intermediate visual sketches using external tools.
- The model interleaves visual sketches with textual reasoning.
- It uses a self-correcting mechanism and stepwise reward.
- The cold-start stage uses a synthesized interleaved VT-CoT dataset.
- The dataset includes a reflection mechanism.
- The paper is on arXiv with ID 2605.26520.
- The model aims to improve long-horizon visual understanding.
Entities
Institutions
- arXiv