LRA-EE: Early Exit Bypasses Quantization Collapse in CLIP
A recent study published on arXiv (2605.26415) uncovers a failure mode in quantized CLIP models, termed Quantization-Induced Representation Collapse (QIRC). In the INT8 CLIP ViT-B/32 model, activation noise builds up through transformer layers, diminishing cosine alignment during zero-shot retrieval. The ratio of noise to signal increases from under 10% in the initial layers to 52% by Layer 11. To address this issue, the authors suggest LRA-EE (Layer-wise Representation-Aware Early Exit), which utilizes Spatio-Semantic Aggregation, a learned multi-feature gate, and Layer-adaptive Confidence Threshold to circumvent noise-dominated deeper layers.
Key facts
- arXiv:2605.26415v1
- INT8 quantization introduces a failure mode in CLIP distinct from quantized CNN classifiers
- Activation noise perturbs multimodal embedding direction
- Quantization-Induced Representation Collapse (QIRC) is characterized
- Noise-to-signal ratio grows from below 10% to 52% at Layer 11 in INT8 CLIP ViT-B/32
- LRA-EE (Layer-wise Representation-Aware Early Exit) is proposed
- Spatio-Semantic Aggregation replaces immature shallow [CLS] with global patch-token average
- Learned multi-feature gate uses confidence, top-2 margin, spatial-activation variance
Entities
Institutions
- arXiv