OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance
A novel framework for token pruning, known as OmniDrop, has been introduced for omni-modal large language models. This technique systematically prunes audiovisual tokens within the decoder layers of the LLM instead of at the input stage, tackling the issue of token proliferation resulting from high-resolution audio and video data. Utilizing text queries for guidance, it enables modality-agnostic and task-adaptive pruning, ensuring that earlier layers maintain adequate fusion of omni-modal information before more aggressive token removal occurs in the deeper layers. This strategy seeks to address the shortcomings of current methods that rely on audio-video similarity or temporal co-occurrence. The findings are published in arXiv paper 2605.14458.
Key facts
- OmniDrop is a training-free, layer-wise token pruning framework.
- It prunes audiovisual tokens within the LLM decoder layers, not at input level.
- Text queries guide modality-agnostic and task-adaptive pruning.
- Addresses token explosion from high-resolution audio and video inputs.
- Designed for real-time applications and long-form reasoning.
- Proposed in arXiv paper 2605.14458.
- Overcomes unreliable assumptions of existing omni-modal token compression methods.
- Early layers preserve omni-modal information fusion before deeper layer pruning.
Entities
Institutions
- arXiv