Decomposed Framework Improves Open-Vocabulary Segmentation
The recently proposed framework, Decomposed Vision-Language Alignment, advances fine-grained open-vocabulary segmentation by breaking down textual prompts into individual concept and attribute tokens. This separation facilitates unique cross-modal interactions for each semantic unit. A Feature-Gated Cross-Attention module is introduced, which produces attribute-specific gating maps for multiplicative fusion, thereby reinforcing compositional semantics. At the scoring stage, similarities for each token are compiled in log-space, ensuring stable and interpretable matching. This method can be integrated into current transformer-based segmentation models and enhances generalization to previously unseen object-attribute pairs. The research paper can be found on arXiv (2605.15942).
Key facts
- The framework factorizes textual prompts into concept and attribute tokens.
- A Feature-Gated Cross-Attention module generates attribute-specific gating maps.
- Per-token similarities are aggregated in log-space for compositional matching.
- The method integrates into existing transformer-based segmentation architectures.
- It improves generalization to unseen combinations of object categories and attributes.
- The paper is published on arXiv with ID 2605.15942.
Entities
Institutions
- arXiv