SpecPL: Spectral Prompt Learning for Vision-Language Models

ai-technology · 2026-05-07

SpecPL has unveiled an innovative prompt learning technique for vision-language models (VLMs) that tackles modality asymmetry by separating spectral granularity. Current methods typically optimize text tokens while relying on a static visual encoder that overlooks intricate spectral nuances. In contrast, SpecPL employs a frozen VAE to break down visual signals into both semantic low-frequency bands and detailed high-frequency components. A Visual Semantic Bank aligns text representations with low-frequency invariants, helping to minimize overfitting. The approach achieves fine-grained discrimination through counterfactual granule training, which rearranges high-frequency signals, compelling the model to differentiate between visual granularity and semantic invariance. This methodology is elaborated in a paper available on arXiv, identified by ID 2605.04504.

Key facts

SpecPL stands for Disentangling Spectral Granularity for Prompt Learning.
It addresses modality asymmetry in VLM prompt learning.
Uses a frozen VAE to decompose visual signals.
Separates signals into low-frequency (semantic) and high-frequency (granular) bands.
Employs a frozen Visual Semantic Bank for low-frequency anchoring.
Counterfactual granule training permutes high-frequency signals.
Paper available on arXiv with ID 2605.04504.
Published on arXiv under cross category.

SpecPL: Spectral Prompt Learning for Vision-Language Models

Key facts

Entities

Institutions

Sources