ARTFEED — Contemporary Art Intelligence

Decomposed Framework Improves Open-Vocabulary Segmentation

ai-technology · 2026-05-18

The recently proposed framework, Decomposed Vision-Language Alignment, advances fine-grained open-vocabulary segmentation by breaking down textual prompts into individual concept and attribute tokens. This separation facilitates unique cross-modal interactions for each semantic unit. A Feature-Gated Cross-Attention module is introduced, which produces attribute-specific gating maps for multiplicative fusion, thereby reinforcing compositional semantics. At the scoring stage, similarities for each token are compiled in log-space, ensuring stable and interpretable matching. This method can be integrated into current transformer-based segmentation models and enhances generalization to previously unseen object-attribute pairs. The research paper can be found on arXiv (2605.15942).

Key facts

  • The framework factorizes textual prompts into concept and attribute tokens.
  • A Feature-Gated Cross-Attention module generates attribute-specific gating maps.
  • Per-token similarities are aggregated in log-space for compositional matching.
  • The method integrates into existing transformer-based segmentation architectures.
  • It improves generalization to unseen combinations of object categories and attributes.
  • The paper is published on arXiv with ID 2605.15942.

Entities

Institutions

  • arXiv

Sources