VideoSeeker: Visual Prompting for Instance-Level Video Understanding
A research paper introduces VideoSeeker, a new paradigm for instance-level video understanding that uses visual prompts instead of text prompts. The approach integrates agentic reasoning with video tasks, enabling models to proactively perceive and retrieve relevant video segments. The paper addresses limitations of current LVLMs in precise spatiotemporal localization, where text prompts fail to provide accurate spatial and temporal references. VideoSeeker aims to improve user experience by centering reasoning around visual content rather than language. The work is published on arXiv under ID 2605.16079.
Key facts
- VideoSeeker is a novel paradigm for instance-level video understanding via visual prompts.
- It integrates agentic reasoning with instance-level video understanding tasks.
- The approach enables models to proactively perceive and retrieve relevant video segments.
- It addresses challenges in precise spatiotemporal localization at the instance level.
- Existing methods rely on text prompts which struggle with spatial and temporal references.
- Current approaches decouple visual perception from language reasoning.
- The paper is published on arXiv with ID 2605.16079.
Entities
Institutions
- arXiv