ViTC-UNet: Hybrid Model for Medical Image Segmentation
Researchers have introduced ViTC-UNet, a novel architecture that combines Vision Transformers (ViTs) with UNet for domain-adaptive semantic segmentation in biomedical imaging. The model addresses performance gaps of ViTs on sparse, fine-structured, and low signal-to-noise targets by conditioning a UNet on frozen pre-trained ViT representations via learnable tokens and a two-way attention decoder. This approach integrates global visual priors from ViTs with the local inductive bias and high-resolution decoding of UNets, avoiding end-to-end fine-tuning. ViTC-UNet outperforms baselines on MRI and CT modalities. The paper is available on arXiv (2605.16393).
Key facts
- ViTC-UNet conditions a UNet on frozen pre-trained ViT representations
- Uses learnable tokens and a two-way attention decoder
- Combines ViT global priors with UNet local inductive bias
- Avoids end-to-end ViT fine-tuning in cross-domain settings
- Outperforms baselines on MRI and CT semantic segmentation
- Addresses performance gap for sparse, fine-structured targets
- Published on arXiv with ID 2605.16393
- Targets biomedical image analysis
Entities
Institutions
- arXiv