ARTFEED — Contemporary Art Intelligence

RotateK: Rotation-Based Key Channel Pruning for Efficient VLM Inference

ai-technology · 2026-05-20

Researchers have introduced RotateK, a framework for structured Key channel pruning based on rotation, aimed at alleviating KV cache pressure during inference in Vision-Language Models (VLMs). These models transform a single image into thousands of tokens, resulting in significant memory consumption. Current token pruning techniques tend to eliminate visual information, which negatively impacts fine-grained perception tasks. By leveraging feature sparsity, RotateK compresses the channel dimension, thereby retaining more visual tokens within a fixed KV cache budget. It employs an online PCA-based rotation to synchronize token-dependent channel importance into a unified low-dimensional subspace, facilitating precise pruning with a lightweight, hardware-friendly head-wise structure. This approach balances the expressive nature of unstructured token-wise pruning with the robustness of head-wise methods. The full details can be found in arXiv:2605.19218.

Key facts

  • RotateK is a rotation-based structured Key channel pruning framework for VLMs.
  • VLMs suffer KV cache pressure because a single image encodes into thousands of tokens.
  • Token pruning permanently discards visual content, harming fine-grained perception tasks.
  • RotateK compresses the channel dimension to preserve more visual tokens at the same memory cost.
  • It uses an online PCA-based rotation to align channel importance into a shared subspace.
  • The method enables accurate pruning under lightweight head-wise hardware-friendly structure.
  • Prior key channel pruning methods faced a trade-off between expressiveness and hardware-friendliness.
  • The paper is available on arXiv with ID 2605.19218.

Entities

Institutions

  • arXiv

Sources