Qwen3-VL-Seg: Open-World Referring Segmentation with Vision-Language Grounding

ai-technology · 2026-05-11

Qwen3-VL-Seg is an efficient framework aimed at open-world referring segmentation, which involves linking unrestricted language expressions to specific pixel-level areas. Although existing multimodal large language models (MLLMs) excel in open-world visual grounding, they are constrained to sparse bounding-box coordinates, which are inadequate for detailed visual predictions. Current segmentation techniques based on MLLMs either generate sparse contour coordinates, facing challenges with continuous object edges, or depend on external models like SAM, which introduces additional complexity. In contrast, Qwen3-VL-Seg utilizes the MLLM-generated box as a semantically grounded structural prior and translates it into pixel-level segmentation through a lightweight box-guided mask decoder that integrates multi-scale spatial feature injection and spatial-semantic queries, aiming for enhanced efficiency and reduced architectural complexity.

Key facts

Qwen3-VL-Seg is a parameter-efficient framework for open-world referring segmentation.
It grounds unconstrained language expressions to pixel-level regions.
Existing MLLMs are limited to sparse bounding-box coordinates.
Current MLLM-based segmentation methods struggle with continuous object boundaries or rely on external models like SAM.
Qwen3-VL-Seg uses an MLLM-predicted box as a structural prior.
A lightweight box-guided mask decoder combines multi-scale spatial feature injection and spatial-semantic query.
The framework reduces architectural and deployment overhead.
The paper is available on arXiv with ID 2605.07141.

Qwen3-VL-Seg: Open-World Referring Segmentation with Vision-Language Grounding

Key facts

Entities

Institutions

Sources