Adaptive Patch Transformers Speed Up Vision Models

ai-technology · 2026-04-25

Researchers propose Adaptive Patch Transformers (APT) to accelerate Vision Transformers (ViTs) by using multiple patch sizes within a single image. APT allocates larger patches to homogeneous areas and smaller patches to complex regions, reducing total input tokens. This method achieves a 40% throughput increase on ViT-L and 50% on ViT-H while maintaining performance. It can be applied to fine-tuned ViTs, converging in as little as one epoch. APT also speeds up high-resolution dense visual tasks like visual QA, object detection, and semantic segmentation by up to 30%.

Key facts

APT uses multiple patch sizes within the same image.
Larger patches are allocated to homogeneous areas, smaller patches to complex ones.
APT increases throughput by 40% on ViT-L and 50% on ViT-H.
Can be applied to previously fine-tuned ViTs, converging in one epoch.
Reduces training and inference time by up to 30% in dense visual tasks.
Tasks include visual QA, object detection, and semantic segmentation.
APT addresses the issue of uniformly sized patches in ViTs.
The method maintains downstream performance.

Entities

—

Sources

arXiv cs.AI — 2026-04-25