ViTC-UNet: Hybrid Model for Medical Image Segmentation

ai-technology · 2026-05-20

Researchers have introduced ViTC-UNet, a novel architecture that combines Vision Transformers (ViTs) with UNet for domain-adaptive semantic segmentation in biomedical imaging. The model addresses performance gaps of ViTs on sparse, fine-structured, and low signal-to-noise targets by conditioning a UNet on frozen pre-trained ViT representations via learnable tokens and a two-way attention decoder. This approach integrates global visual priors from ViTs with the local inductive bias and high-resolution decoding of UNets, avoiding end-to-end fine-tuning. ViTC-UNet outperforms baselines on MRI and CT modalities. The paper is available on arXiv (2605.16393).

Key facts

ViTC-UNet conditions a UNet on frozen pre-trained ViT representations
Uses learnable tokens and a two-way attention decoder
Combines ViT global priors with UNet local inductive bias
Avoids end-to-end ViT fine-tuning in cross-domain settings
Outperforms baselines on MRI and CT semantic segmentation
Addresses performance gap for sparse, fine-structured targets
Published on arXiv with ID 2605.16393
Targets biomedical image analysis

ViTC-UNet: Hybrid Model for Medical Image Segmentation

Key facts

Entities

Institutions

Sources