Research Paper Analyzes Collapse in Training-Free Token Reduction Methods for Vision Transformers
A new research paper published on arXiv (ID: 2604.16745v1) investigates why training-free token reduction methods for Vision Transformers experience sudden performance collapse at high compression rates. The study examines methods including ToMe, ToFu, PiToMe, and MCTF, which all demonstrate similar cliff-like failure patterns despite employing different scoring mechanisms. Researchers developed a diagnostic framework with two analytical tools: ranking consistency (ρ_s) and off-diagonal correlation (ρ_off). This framework reveals that collapse stems from two primary factors: a signal-agnostic error amplifier inherent to layer-wise reduction processes, which predicts convex Pareto curves and critical reduction ratios proportional to 1/L; and the shared dependence on pairwise similarity signals whose ranking consistency deteriorates dramatically from ρ_s=0.88 to 0.27 in deeper network layers. The paper demonstrates that pairwise ranking approaches suffer from inherent instability due to O(N_p^2) joint perturbations, while unary signals maintain greater stability through O(N_p) perturbations that follow Central Limit Theorem principles. From this diagnosis, researchers derived three design principles and constructed CATIS as a constructive validation system using unary signals. The research provides fundamental insights into the limitations of current token reduction approaches in vision transformer architectures.
Key facts
- Research paper published on arXiv with ID 2604.16745v1
- Analyzes training-free token reduction methods for Vision Transformers
- Examines ToMe, ToFu, PiToMe, and MCTF methods
- All methods show similar cliff-like collapse at high compression
- Developed diagnostic framework with ranking consistency and off-diagonal correlation tools
- Identifies signal-agnostic error amplifier in layer-wise reduction
- Pairwise similarity signals degrade from ρ_s=0.88 to 0.27 in deep layers
- Constructed CATIS system as validation using unary signals
Entities
Institutions
- arXiv