CRAFT: New AI Pipeline Achieves Best Performance on Multimodal Video QA Benchmark
A team of researchers has unveiled CRAFT (Critic-Refined Adaptive Key-Frame Targeting), a sophisticated pipeline designed for grounded multi-video question answering related to real-world news events. This innovative system features dynamic keyframe selection, automatic speech recognition for each video with multilingual support, and a hybrid critic loop that continuously checks and corrects claims before finalizing them. It employs UNLI temporal entailment, DeBERTa-v3 for cross-claim evaluation, and a Llama-3.2-3B adjudicator, culminating in a citation-merging process that presents each fact alongside its source identifiers. CRAFT achieved an impressive overall average score of 0.739, a reference recall of 0.810, and a citation F1 score of 0.635 on the MAGMaR 2026 benchmark, and also excelled in a MAGMaR-style WikiVideo evaluation involving 52 distinct event queries. This work effectively tackles the challenge of retrieving relevant evidence from diverse video archives while ensuring each claim is properly attributed to its source.
Key facts
- CRAFT stands for Critic-Refined Adaptive Key-Frame Targeting.
- The pipeline combines dynamic keyframe selection, ASR with multilingual fallback, and a hybrid critic loop.
- It uses UNLI temporal entailment, DeBERTa-v3 cross-claim screening, and a Llama-3.2-3B adjudicator.
- The final stage merges citations so each fact is emitted once with all supporting source identifiers.
- On MAGMaR 2026, CRAFT achieved best overall average (0.739), reference recall (0.810), and citation F1 (0.635).
- Evaluation also included a MAGMaR-style conversion of WikiVideo with 52 non-overlapping event queries.
- The system is designed for grounded multi-video question answering over real-world news events.
- The paper is published on arXiv with ID 2605.19075.
Entities
Institutions
- arXiv