X2SAM: Unified Segmentation Model for Images and Videos

ai-technology · 2026-05-06

A new unified segmentation MLLM named X2SAM has been developed by researchers, expanding any-segmentation features from images to videos. This model integrates a Mask Memory module with an LLM to produce temporally consistent video masks, accommodating both visual and textual prompts. It overcomes the shortcomings of current MLLMs, which are typically tailored for either images or videos and struggle with intricate conversational directives. X2SAM facilitates generic, open-vocabulary, referring, reasoning, and grounded conversational segmentation in both formats.

Key facts

X2SAM is a unified segmentation MLLM for images and videos
It uses an LLM with a Mask Memory module
Supports textual and visual prompts
Enables temporally consistent video mask generation
Addresses limitations of existing segmentation MLLMs
Supports generic, open-vocabulary, referring, reasoning, and grounded conversational segmentation
Published on arXiv with ID 2605.00891
Announce type: cross

Entities

—

Sources

arXiv cs.AI — 2026-05-05