Variational Adapter Improves Cross-Modal Similarity in Vision-Language Models
A new method called Variational Adapter for Cross-modal Similarity Representation (VACSR) addresses the problem of false negatives in vision-language models. Current image-text matching datasets often lack fine-grained annotations, forcing continuous similarity into binary classification and impairing generalization. VACSR reformulates the task as variational inference, constructing a latent space for similarity and using regularization to handle annotation flaws. The approach is detailed in a paper on arXiv (2605.30968).
Key facts
- VACSR stands for Variational Adapter for Cross-modal Similarity Representation
- It addresses false negatives in image-text matching
- Current datasets lack fine-grained cross-modal annotations
- The method uses variational inference to model similarity
- It constructs a latent space for cross-modal similarity
- Regularization techniques are employed for uncertainty allocation
- The paper is available on arXiv with ID 2605.30968
- The approach aims to improve generalization in cross-modal tasks
Entities
Institutions
- arXiv