OmniRefine: Training-Free Compression for Omni-LLMs
OmniRefine is a two-stage framework that requires no training and focuses on efficient compression of audio-visual tokens in omnimodal large language models (Omni-LLMs). It tackles the significant inference expenses associated with lengthy video streams and dense audio sequences by ensuring cross-modal alignment. The initial phase, Correspondence-Preserving Chunk Refinement, employs frame-audio similarity and dynamic programming to create coherent cross-modal units from native chunk boundaries. The subsequent phase, Modality-Aware Cooperative Compression, compresses video and audio tokens simultaneously. This approach aims to enhance inference efficiency while preserving performance, addressing the challenges posed by fixed or native compression units that can hinder audio-video reasoning.
Key facts
- OmniRefine is a training-free two-stage framework.
- It compresses audio-visual tokens in Omni-LLMs.
- First stage: Correspondence-Preserving Chunk Refinement.
- Second stage: Modality-Aware Cooperative Compression.
- Uses frame-audio similarity and dynamic programming.
- Aims to reduce inference cost for long video and dense audio.
- Preserves cross-modal correspondence.
- Published on arXiv with ID 2605.12056.
Entities
Institutions
- arXiv