AnchorDiff: Training-Free Concept Grounding for MM-DiTs
AnchorDiff is a novel approach that eliminates the need for training in Multi-Modal Diffusion Transformers (MM-DiTs) and tackles the issue of concept leakage, which occurs when attention-based techniques generate overlapping activations for visually similar concepts. This method separates semantic localization from structural refinement by choosing a high-confidence anchor from concept-to-image attention maps and spreading it as a one-hot seed across a hybrid graph based on image-to-image self-attention. The graph employs output-space similarity for thorough within-object propagation and utilizes a row-wise attention gate to minimize cross-object connections. Additionally, the researchers present the Multi-Concept Confusion Dataset, featuring images with multiple similar concepts and distinct masks for precise evaluation. The paper can be found on arXiv under reference 2605.26460.
Key facts
- AnchorDiff is a training-free grounding method for MM-DiTs.
- It addresses concept leakage in attention-based methods.
- The method selects a high-confidence anchor from concept-to-image attention maps.
- It propagates the anchor as a one-hot seed over a hybrid graph from self-attention.
- The graph uses output-space similarity for within-object propagation.
- A row-wise attention gate suppresses cross-object connections.
- The Multi-Concept Confusion Dataset contains images with multiple visually similar concepts and masks.
- The paper is on arXiv (2605.26460).
Entities
Institutions
- arXiv