SpecPL: Spectral Prompt Learning for Vision-Language Models
SpecPL has unveiled an innovative prompt learning technique for vision-language models (VLMs) that tackles modality asymmetry by separating spectral granularity. Current methods typically optimize text tokens while relying on a static visual encoder that overlooks intricate spectral nuances. In contrast, SpecPL employs a frozen VAE to break down visual signals into both semantic low-frequency bands and detailed high-frequency components. A Visual Semantic Bank aligns text representations with low-frequency invariants, helping to minimize overfitting. The approach achieves fine-grained discrimination through counterfactual granule training, which rearranges high-frequency signals, compelling the model to differentiate between visual granularity and semantic invariance. This methodology is elaborated in a paper available on arXiv, identified by ID 2605.04504.
Key facts
- SpecPL stands for Disentangling Spectral Granularity for Prompt Learning.
- It addresses modality asymmetry in VLM prompt learning.
- Uses a frozen VAE to decompose visual signals.
- Separates signals into low-frequency (semantic) and high-frequency (granular) bands.
- Employs a frozen Visual Semantic Bank for low-frequency anchoring.
- Counterfactual granule training permutes high-frequency signals.
- Paper available on arXiv with ID 2605.04504.
- Published on arXiv under cross category.
Entities
Institutions
- arXiv