ST-GridPool: Training-Free Visual Token Enhancement for Video LLMs
A new method called ST-GridPool improves video understanding in Large Language Models without requiring additional training. Proposed by researchers, it combines Pyramid Temporal Gridding (PTG) for multi-grained spatiotemporal interactions and Norm-based Spatial Pooling (NSP) to preserve high-information visual regions. Experiments show consistent performance gains across benchmarks.
Key facts
- ST-GridPool is a training-free visual token enhancement method for Video LLMs.
- It integrates Pyramid Temporal Gridding (PTG) and Norm-based Spatial Pooling (NSP).
- PTG captures multi-grained spatiotemporal interactions through hierarchical temporal gridding.
- NSP leverages correlation between token norms and semantic richness.
- The method addresses limitations of existing pooling and interpolation techniques.
- Experiments on various benchmarks show consistent performance improvements.
- The paper is available on arXiv with ID 2605.22078.
- The approach is designed specifically for Video Large Language Models.
Entities
—