ARTFEED — Contemporary Art Intelligence

X2SAM: Unified Segmentation Model for Images and Videos

ai-technology · 2026-05-06

A new unified segmentation MLLM named X2SAM has been developed by researchers, expanding any-segmentation features from images to videos. This model integrates a Mask Memory module with an LLM to produce temporally consistent video masks, accommodating both visual and textual prompts. It overcomes the shortcomings of current MLLMs, which are typically tailored for either images or videos and struggle with intricate conversational directives. X2SAM facilitates generic, open-vocabulary, referring, reasoning, and grounded conversational segmentation in both formats.

Key facts

  • X2SAM is a unified segmentation MLLM for images and videos
  • It uses an LLM with a Mask Memory module
  • Supports textual and visual prompts
  • Enables temporally consistent video mask generation
  • Addresses limitations of existing segmentation MLLMs
  • Supports generic, open-vocabulary, referring, reasoning, and grounded conversational segmentation
  • Published on arXiv with ID 2605.00891
  • Announce type: cross

Entities

Sources