GeoWorld-VLM Enhances Spatial Reasoning in Vision-Language Models

ai-technology · 2026-05-20

GeoWorld-VLM is a cutting-edge system designed to improve spatial reasoning in Vision-Language Models (VLMs) by leveraging geometric frameworks from static video world models based on camera conditions. This technique specifically hones in on the image encoder and the multimodal projector, ensuring that the features from the image align with the world model's intermediate representations, while leaving the main structure intact. By inputting images, a prompt, and a camera path, the world-model teacher converts static images into dynamic multi-view spatial signals. This innovation addresses a key issue in VLMs, which often struggle with understanding basic spatial relations like "left of" or "behind," due to the loss of 3D information during feature processing. You can find the research on arXiv under the code 2605.16713.

Key facts

GeoWorld-VLM is a VLM-side distillation framework.
It transfers geometric structure from frozen camera-conditioned video world models into VLMs.
It fine-tunes only the image encoder and multimodal projector.
The approach aligns post-projector image features with intermediate world-model representations.
The main backbone remains frozen.
The world-model teacher converts static visual input into synthetic multi-view spatial signals.
VLMs often fail at spatial relations like left of, on, behind, and between.
The paper is available on arXiv with ID 2605.16713.

GeoWorld-VLM Enhances Spatial Reasoning in Vision-Language Models

Key facts

Entities

Institutions

Sources