Vision-Aligned Text Supervision Enhances Portrait Generation

ai-technology · 2026-05-22

To tackle the trilemma of human portrait generation—text-image alignment, photorealism, and aesthetics—researchers have introduced a feature supervision paradigm for multimodal diffusion transformers (MM-DiT). Their innovative method employs a lightweight cross-modal alignment mechanism that derives multi-granularity vision-aligned text representations from SigLIP 2, applying supervision to the image branch throughout training without incurring additional inference costs. This strategy mitigates the risks of overfitting and the degradation of pre-trained priors typically associated with supervised fine-tuning (SFT). The findings are detailed in a paper available on arXiv (2605.20640).

Key facts

Text-to-image diffusion models face a trilemma in portrait generation: alignment, photorealism, and aesthetics.
Supervised Fine-Tuning (SFT) can improve photorealism but causes overfitting and degrades alignment or aesthetics.
The proposed method uses a lightweight cross-modal alignment mechanism with SigLIP 2.
Supervision is applied to the image branch of MM-DiT during training.
Zero extra inference overhead is required.
The method preserves the base model's generalization.
The paper is available on arXiv (2605.20640).
The approach is designed for multimodal diffusion transformers (MM-DiT).

Vision-Aligned Text Supervision Enhances Portrait Generation

Key facts

Entities

Institutions

Sources