AnchorDiff: Training-Free Concept Grounding for MM-DiTs

ai-technology · 2026-05-27

AnchorDiff is a novel approach that eliminates the need for training in Multi-Modal Diffusion Transformers (MM-DiTs) and tackles the issue of concept leakage, which occurs when attention-based techniques generate overlapping activations for visually similar concepts. This method separates semantic localization from structural refinement by choosing a high-confidence anchor from concept-to-image attention maps and spreading it as a one-hot seed across a hybrid graph based on image-to-image self-attention. The graph employs output-space similarity for thorough within-object propagation and utilizes a row-wise attention gate to minimize cross-object connections. Additionally, the researchers present the Multi-Concept Confusion Dataset, featuring images with multiple similar concepts and distinct masks for precise evaluation. The paper can be found on arXiv under reference 2605.26460.

Key facts

AnchorDiff is a training-free grounding method for MM-DiTs.
It addresses concept leakage in attention-based methods.
The method selects a high-confidence anchor from concept-to-image attention maps.
It propagates the anchor as a one-hot seed over a hybrid graph from self-attention.
The graph uses output-space similarity for within-object propagation.
A row-wise attention gate suppresses cross-object connections.
The Multi-Concept Confusion Dataset contains images with multiple visually similar concepts and masks.
The paper is on arXiv (2605.26460).

AnchorDiff: Training-Free Concept Grounding for MM-DiTs

Key facts

Entities

Institutions

Sources