UpstreamQA: Modular Framework for Explicit Reasoning in VideoQA
UpstreamQA is a modular framework designed to improve Video Question Answering (VideoQA) by making reasoning explicit. Current large multimodal models (LMMs) perform multi-step reasoning implicitly, which is opaque. Large reasoning models (LRMs) generate intermediate logical steps for interpretability but lack native video understanding, relying on static frames. UpstreamQA uses multimodal LRMs for object identification and scene context generation, then passes enriched traces to downstream LMMs. It was evaluated on the OpenEQA benchmark. The framework disentangles core video reasoning components, enhancing transparency and accuracy. The paper is available on arXiv under ID 2604.23145.
Key facts
- UpstreamQA is a modular framework for VideoQA.
- It uses explicit upstream reasoning modules.
- Current LMMs perform reasoning implicitly.
- LRMs generate intermediate logical steps.
- LRMs are not designed for native video understanding.
- UpstreamQA employs multimodal LRMs for object identification.
- It also generates scene context.
- Enriched reasoning traces are passed to downstream LMMs.
- Evaluated on OpenEQA.
- Paper ID: arXiv:2604.23145.
Entities
Institutions
- arXiv