JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

ai-technology · 2026-05-27

A group of researchers has unveiled JetViT, a series of hybrid-architecture Vision Transformer (ViT) models that achieve the same accuracy as leading full-attention vision foundation models while significantly enhancing inference efficiency for high-resolution images. The primary breakthrough is the Post-Training Attention Search, an acceleration framework that transforms pre-trained full-attention ViTs into efficient hybrid-attention models by identifying and substituting unnecessary full-attention blocks with linear or window-attention alternatives. This framework retains MLP and attention weights from the original model and navigates the design space through three essential phases: optimizing linear-attention block design, determining the optimal mix of linear and window-attention blocks, and recognizing and maintaining vital full-attention blocks. JetViT shows substantial efficiency improvements in high-resolution image tasks without compromising accuracy. The research is available on arXiv under ID 2605.26636.

Key facts

JetViT is a family of hybrid-architecture Vision Transformer models.
It matches accuracy of state-of-the-art full-attention vision foundation models.
It achieves higher inference efficiency on high-resolution images.
Core approach is Post-Training Attention Search.
Post-Training Attention Search converts full-attention ViTs to hybrid-attention variants.
It replaces redundant full-attention blocks with linear or window-attention blocks.
The framework inherits MLP and attention weights from the base model.
Three key steps: optimize linear-attention block design, find best combination of linear and window attention, identify critical full-attention blocks.

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

Key facts

Entities

Institutions

Sources