ARTFEED — Contemporary Art Intelligence

TextTeacher: Language Model Boosts Vision Accuracy by 2.7 Points

ai-technology · 2026-05-23

A new method called TextTeacher uses language model embeddings to improve image classification without changing inference. The approach, introduced in a paper on arXiv (2605.22098), adds a lightweight auxiliary objective during training that injects semantic anchors from a frozen text encoder. On ImageNet with standard ViT backbones, accuracy improves by up to 2.7 percentage points, with consistent transfer gains averaging +1.0 point. TextTeacher outperforms vision knowledge distillation under the same compute budget.

Key facts

  • TextTeacher is a new auxiliary objective for image classification
  • It uses a pre-trained, frozen text encoder and a lightweight projection
  • Semantic anchors are produced from image captions
  • Inference-time model remains unchanged
  • On ImageNet with ViT, accuracy improves by up to +2.7 percentage points
  • Average transfer gain is +1.0 percentage point
  • Outperforms vision knowledge distillation
  • Paper published on arXiv with ID 2605.22098

Entities

Institutions

  • arXiv

Sources