Adaptive Patch Transformers Speed Up Vision Models
Researchers propose Adaptive Patch Transformers (APT) to accelerate Vision Transformers (ViTs) by using multiple patch sizes within a single image. APT allocates larger patches to homogeneous areas and smaller patches to complex regions, reducing total input tokens. This method achieves a 40% throughput increase on ViT-L and 50% on ViT-H while maintaining performance. It can be applied to fine-tuned ViTs, converging in as little as one epoch. APT also speeds up high-resolution dense visual tasks like visual QA, object detection, and semantic segmentation by up to 30%.
Key facts
- APT uses multiple patch sizes within the same image.
- Larger patches are allocated to homogeneous areas, smaller patches to complex ones.
- APT increases throughput by 40% on ViT-L and 50% on ViT-H.
- Can be applied to previously fine-tuned ViTs, converging in one epoch.
- Reduces training and inference time by up to 30% in dense visual tasks.
- Tasks include visual QA, object detection, and semantic segmentation.
- APT addresses the issue of uniformly sized patches in ViTs.
- The method maintains downstream performance.
Entities
—