TextTeacher: Language Model Boosts Vision Accuracy by 2.7 Points

ai-technology · 2026-05-23

A new method called TextTeacher uses language model embeddings to improve image classification without changing inference. The approach, introduced in a paper on arXiv (2605.22098), adds a lightweight auxiliary objective during training that injects semantic anchors from a frozen text encoder. On ImageNet with standard ViT backbones, accuracy improves by up to 2.7 percentage points, with consistent transfer gains averaging +1.0 point. TextTeacher outperforms vision knowledge distillation under the same compute budget.

Key facts

TextTeacher is a new auxiliary objective for image classification
It uses a pre-trained, frozen text encoder and a lightweight projection
Semantic anchors are produced from image captions
Inference-time model remains unchanged
On ImageNet with ViT, accuracy improves by up to +2.7 percentage points
Average transfer gain is +1.0 percentage point
Outperforms vision knowledge distillation
Paper published on arXiv with ID 2605.22098

TextTeacher: Language Model Boosts Vision Accuracy by 2.7 Points

Key facts

Entities

Institutions

Sources