DynFrame: Adaptive Reasoning for Video Understanding

ai-technology · 2026-05-27

A new framework called DynFrame addresses two structural gaps in video multimodal large language models (MLLMs). First, existing methods use a fixed per-window frame rate, forcing repeated retrieval calls for fine-grained evidence. DynFrame makes sampling density a learnable decision, emitting both the temporal window and sampling density as native tokens. Second, retrieval and answer generation are typically optimized with a single trajectory-level advantage, conflating credit for correct and incorrect steps. DynFrame decouples these, enabling more precise optimization. The framework is detailed in arXiv paper 2605.26680.

Key facts

DynFrame is a framework for complex video understanding.
It addresses fixed sampling density in existing video MLLMs.
It makes temporal window and sampling density learnable decisions.
It decouples retrieval and answer generation optimization.
The paper is on arXiv with ID 2605.26680.
It is a cross-type announcement.
The framework aims to reduce inference context length.
It targets step-by-step reasoning with on-demand visual evidence.

DynFrame: Adaptive Reasoning for Video Understanding

Key facts

Entities

Institutions

Sources