ARTFEED — Contemporary Art Intelligence

ST-GridPool: Training-Free Visual Token Enhancement for Video LLMs

ai-technology · 2026-05-23

A new method called ST-GridPool improves video understanding in Large Language Models without requiring additional training. Proposed by researchers, it combines Pyramid Temporal Gridding (PTG) for multi-grained spatiotemporal interactions and Norm-based Spatial Pooling (NSP) to preserve high-information visual regions. Experiments show consistent performance gains across benchmarks.

Key facts

  • ST-GridPool is a training-free visual token enhancement method for Video LLMs.
  • It integrates Pyramid Temporal Gridding (PTG) and Norm-based Spatial Pooling (NSP).
  • PTG captures multi-grained spatiotemporal interactions through hierarchical temporal gridding.
  • NSP leverages correlation between token norms and semantic richness.
  • The method addresses limitations of existing pooling and interpolation techniques.
  • Experiments on various benchmarks show consistent performance improvements.
  • The paper is available on arXiv with ID 2605.22078.
  • The approach is designed specifically for Video Large Language Models.

Entities

Sources