ST-GridPool: Training-Free Visual Token Enhancement for Video LLMs

ai-technology · 2026-05-23

A new method called ST-GridPool improves video understanding in Large Language Models without requiring additional training. Proposed by researchers, it combines Pyramid Temporal Gridding (PTG) for multi-grained spatiotemporal interactions and Norm-based Spatial Pooling (NSP) to preserve high-information visual regions. Experiments show consistent performance gains across benchmarks.

Key facts

ST-GridPool is a training-free visual token enhancement method for Video LLMs.
It integrates Pyramid Temporal Gridding (PTG) and Norm-based Spatial Pooling (NSP).
PTG captures multi-grained spatiotemporal interactions through hierarchical temporal gridding.
NSP leverages correlation between token norms and semantic richness.
The method addresses limitations of existing pooling and interpolation techniques.
Experiments on various benchmarks show consistent performance improvements.
The paper is available on arXiv with ID 2605.22078.
The approach is designed specifically for Video Large Language Models.

Entities

—

Sources

arXiv cs.AI — 2026-05-23