ARTFEED — Contemporary Art Intelligence

Vision Transformers Align with Human Attention at No Cost

ai-technology · 2026-04-24

Researchers from an undisclosed institution have fine-tuned Google's ViT-B/16 Vision Transformer on human saliency fixation maps to improve cognitive alignment. The study, published on arXiv, shows that fine-tuning on human attention data induces three hallmark human-like biases: a shift from anti-human large-object bias toward small-object preference, amplified animacy preference, and diminished extreme attention entropy. Bayesian parity analysis confirms that this alignment does not degrade classification performance on ImageNet. The work addresses the cognitive gap between ViTs and human visual processing, suggesting that interpretability can be enhanced without sacrificing accuracy.

Key facts

  • ViT-B/16 fine-tuned on human saliency fixation maps
  • Five saliency metrics improved significantly
  • Three human-like biases induced: small-object, animacy, entropy
  • Bayesian parity analysis shows no cost to classification performance
  • ImageNet used for performance evaluation
  • arXiv paper ID: 2604.20027
  • Published in April 2026
  • Google's ViT-B/16 architecture used

Entities

Institutions

  • Google
  • arXiv

Sources