RotateK: Rotation-Based Key Channel Pruning for Efficient VLM Inference
Researchers have introduced RotateK, a framework for structured Key channel pruning based on rotation, aimed at alleviating KV cache pressure during inference in Vision-Language Models (VLMs). These models transform a single image into thousands of tokens, resulting in significant memory consumption. Current token pruning techniques tend to eliminate visual information, which negatively impacts fine-grained perception tasks. By leveraging feature sparsity, RotateK compresses the channel dimension, thereby retaining more visual tokens within a fixed KV cache budget. It employs an online PCA-based rotation to synchronize token-dependent channel importance into a unified low-dimensional subspace, facilitating precise pruning with a lightweight, hardware-friendly head-wise structure. This approach balances the expressive nature of unstructured token-wise pruning with the robustness of head-wise methods. The full details can be found in arXiv:2605.19218.
Key facts
- RotateK is a rotation-based structured Key channel pruning framework for VLMs.
- VLMs suffer KV cache pressure because a single image encodes into thousands of tokens.
- Token pruning permanently discards visual content, harming fine-grained perception tasks.
- RotateK compresses the channel dimension to preserve more visual tokens at the same memory cost.
- It uses an online PCA-based rotation to align channel importance into a shared subspace.
- The method enables accurate pruning under lightweight head-wise hardware-friendly structure.
- Prior key channel pruning methods faced a trade-off between expressiveness and hardware-friendliness.
- The paper is available on arXiv with ID 2605.19218.
Entities
Institutions
- arXiv