ARTFEED — Contemporary Art Intelligence

ViTC-UNet: Hybrid Model for Medical Image Segmentation

ai-technology · 2026-05-20

Researchers have introduced ViTC-UNet, a novel architecture that combines Vision Transformers (ViTs) with UNet for domain-adaptive semantic segmentation in biomedical imaging. The model addresses performance gaps of ViTs on sparse, fine-structured, and low signal-to-noise targets by conditioning a UNet on frozen pre-trained ViT representations via learnable tokens and a two-way attention decoder. This approach integrates global visual priors from ViTs with the local inductive bias and high-resolution decoding of UNets, avoiding end-to-end fine-tuning. ViTC-UNet outperforms baselines on MRI and CT modalities. The paper is available on arXiv (2605.16393).

Key facts

  • ViTC-UNet conditions a UNet on frozen pre-trained ViT representations
  • Uses learnable tokens and a two-way attention decoder
  • Combines ViT global priors with UNet local inductive bias
  • Avoids end-to-end ViT fine-tuning in cross-domain settings
  • Outperforms baselines on MRI and CT semantic segmentation
  • Addresses performance gap for sparse, fine-structured targets
  • Published on arXiv with ID 2605.16393
  • Targets biomedical image analysis

Entities

Institutions

  • arXiv

Sources