DynFrame: Adaptive Reasoning for Video Understanding
A new framework called DynFrame addresses two structural gaps in video multimodal large language models (MLLMs). First, existing methods use a fixed per-window frame rate, forcing repeated retrieval calls for fine-grained evidence. DynFrame makes sampling density a learnable decision, emitting both the temporal window and sampling density as native tokens. Second, retrieval and answer generation are typically optimized with a single trajectory-level advantage, conflating credit for correct and incorrect steps. DynFrame decouples these, enabling more precise optimization. The framework is detailed in arXiv paper 2605.26680.
Key facts
- DynFrame is a framework for complex video understanding.
- It addresses fixed sampling density in existing video MLLMs.
- It makes temporal window and sampling density learnable decisions.
- It decouples retrieval and answer generation optimization.
- The paper is on arXiv with ID 2605.26680.
- It is a cross-type announcement.
- The framework aims to reduce inference context length.
- It targets step-by-step reasoning with on-demand visual evidence.
Entities
Institutions
- arXiv