Vision Transformers Align with Human Attention at No Cost

ai-technology · 2026-04-24

Researchers from an undisclosed institution have fine-tuned Google's ViT-B/16 Vision Transformer on human saliency fixation maps to improve cognitive alignment. The study, published on arXiv, shows that fine-tuning on human attention data induces three hallmark human-like biases: a shift from anti-human large-object bias toward small-object preference, amplified animacy preference, and diminished extreme attention entropy. Bayesian parity analysis confirms that this alignment does not degrade classification performance on ImageNet. The work addresses the cognitive gap between ViTs and human visual processing, suggesting that interpretability can be enhanced without sacrificing accuracy.

Key facts

ViT-B/16 fine-tuned on human saliency fixation maps
Five saliency metrics improved significantly
Three human-like biases induced: small-object, animacy, entropy
Bayesian parity analysis shows no cost to classification performance
ImageNet used for performance evaluation
arXiv paper ID: 2604.20027
Published in April 2026
Google's ViT-B/16 architecture used

Vision Transformers Align with Human Attention at No Cost

Key facts

Entities

Institutions

Sources