SketchVLM: VLMs Generate Editable SVG Overlays to Explain Visual Reasoning
A new framework named SketchVLM has been developed by researchers, allowing vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 to create editable SVG overlays on images without requiring prior training. This innovation enables VLMs to visually articulate their thought processes through actions like pointing, labeling, and drawing, overcoming the constraints of text-only outputs. In tests across seven benchmarks that included visual reasoning tasks (such as maze navigation and object counting) and drawing activities (like part labeling and shape drawing), SketchVLM demonstrated accuracy enhancements of up to +28.5 percentage points and improved annotation quality by as much as 1.48 times when compared to traditional image-editing and fine-tuned sketching methods. The annotations align closely with the model's responses, and the framework is adaptable to various models, functioning in a single-turn generation mode.
Key facts
- SketchVLM is a training-free, model-agnostic framework for VLMs.
- It produces non-destructive, editable SVG overlays on input images.
- Tested on VLMs including Gemini-3-Pro and GPT-5.
- Evaluated across seven benchmarks: maze navigation, ball-drop trajectory prediction, object counting, part labeling, connecting-the-dots, drawing shapes around objects.
- Improved visual reasoning task accuracy by up to +28.5 percentage points.
- Improved annotation quality by up to 1.48x relative to baselines.
- Annotations are more faithful to the model's stated answer.
- Single-turn generation process.
Entities
—