Latent Imagination Module Improves Text-Only VLM Calibration
A significant flaw has been uncovered in vision-language models (VLMs) when they are used solely with text inputs: the absence of the vision component leads to notable declines in accuracy and severe miscalibration, despite text descriptions retaining essential semantic information. The model's performance deviates from its original language foundation when prompted with text alone. To tackle this issue, researchers introduce the Latent Imagination Module (LIM), a streamlined cross-attention mechanism that generates imagined latent embeddings from text and integrates them into a static VLM backbone, eliminating the need for pixel-level image generation. LIM enhances accuracy and minimizes calibration errors across various text-only benchmarks, suggesting that latent modality completion can effectively connect multimodal training with text-exclusive applications.
Key facts
- Vision-language models (VLMs) suffer accuracy drops and miscalibration on text-only inputs.
- The failure is not solely due to missing semantic information.
- Adding a visual signal through generated images partially restores accuracy and calibration.
- The Latent Imagination Module (LIM) is a lightweight cross-attention module.
- LIM predicts imagined latent embeddings from textual input.
- LIM feeds embeddings into a frozen VLM backbone without pixel-level image synthesis.
- LIM improves accuracy and reduces calibration error across text-only benchmarks and unseen tasks.
- The study is published on arXiv with ID 2605.12517.
Entities
Institutions
- arXiv