ARTFEED — Contemporary Art Intelligence

Adaptive Patch Transformers Speed Up Vision Models

ai-technology · 2026-04-25

Researchers propose Adaptive Patch Transformers (APT) to accelerate Vision Transformers (ViTs) by using multiple patch sizes within a single image. APT allocates larger patches to homogeneous areas and smaller patches to complex regions, reducing total input tokens. This method achieves a 40% throughput increase on ViT-L and 50% on ViT-H while maintaining performance. It can be applied to fine-tuned ViTs, converging in as little as one epoch. APT also speeds up high-resolution dense visual tasks like visual QA, object detection, and semantic segmentation by up to 30%.

Key facts

  • APT uses multiple patch sizes within the same image.
  • Larger patches are allocated to homogeneous areas, smaller patches to complex ones.
  • APT increases throughput by 40% on ViT-L and 50% on ViT-H.
  • Can be applied to previously fine-tuned ViTs, converging in one epoch.
  • Reduces training and inference time by up to 30% in dense visual tasks.
  • Tasks include visual QA, object detection, and semantic segmentation.
  • APT addresses the issue of uniformly sized patches in ViTs.
  • The method maintains downstream performance.

Entities

Sources