RotateK: Rotation-Based Key Channel Pruning for Efficient VLM Inference

ai-technology · 2026-05-20

Researchers have introduced RotateK, a framework for structured Key channel pruning based on rotation, aimed at alleviating KV cache pressure during inference in Vision-Language Models (VLMs). These models transform a single image into thousands of tokens, resulting in significant memory consumption. Current token pruning techniques tend to eliminate visual information, which negatively impacts fine-grained perception tasks. By leveraging feature sparsity, RotateK compresses the channel dimension, thereby retaining more visual tokens within a fixed KV cache budget. It employs an online PCA-based rotation to synchronize token-dependent channel importance into a unified low-dimensional subspace, facilitating precise pruning with a lightweight, hardware-friendly head-wise structure. This approach balances the expressive nature of unstructured token-wise pruning with the robustness of head-wise methods. The full details can be found in arXiv:2605.19218.

Key facts

RotateK is a rotation-based structured Key channel pruning framework for VLMs.
VLMs suffer KV cache pressure because a single image encodes into thousands of tokens.
Token pruning permanently discards visual content, harming fine-grained perception tasks.
RotateK compresses the channel dimension to preserve more visual tokens at the same memory cost.
It uses an online PCA-based rotation to align channel importance into a shared subspace.
The method enables accurate pruning under lightweight head-wise hardware-friendly structure.
Prior key channel pruning methods faced a trade-off between expressiveness and hardware-friendliness.
The paper is available on arXiv with ID 2605.19218.

RotateK: Rotation-Based Key Channel Pruning for Efficient VLM Inference

Key facts

Entities

Institutions

Sources