Decomposed Framework Improves Open-Vocabulary Segmentation

ai-technology · 2026-05-18

The recently proposed framework, Decomposed Vision-Language Alignment, advances fine-grained open-vocabulary segmentation by breaking down textual prompts into individual concept and attribute tokens. This separation facilitates unique cross-modal interactions for each semantic unit. A Feature-Gated Cross-Attention module is introduced, which produces attribute-specific gating maps for multiplicative fusion, thereby reinforcing compositional semantics. At the scoring stage, similarities for each token are compiled in log-space, ensuring stable and interpretable matching. This method can be integrated into current transformer-based segmentation models and enhances generalization to previously unseen object-attribute pairs. The research paper can be found on arXiv (2605.15942).

Key facts

The framework factorizes textual prompts into concept and attribute tokens.
A Feature-Gated Cross-Attention module generates attribute-specific gating maps.
Per-token similarities are aggregated in log-space for compositional matching.
The method integrates into existing transformer-based segmentation architectures.
It improves generalization to unseen combinations of object categories and attributes.
The paper is published on arXiv with ID 2605.15942.

Decomposed Framework Improves Open-Vocabulary Segmentation

Key facts

Entities

Institutions

Sources