DualFact+ Framework for Multimodal Fact Verification in Video Captioning

ai-technology · 2026-04-30

A new evaluation framework called DualFact+ has been developed by researchers to assess factuality in procedural video captioning through a dual-layer approach. This framework distinguishes between conceptual facts, which include abstract semantic roles like Action, Ingredient, Tool, and Location, and contextual facts, which are based on grounded predicate-argument realizations in the video. It employs implicit argument augmentation (VIA) and contrastive fact sets for thorough evaluation. DualFact+ functions in two modes: DualFact-T, focusing on textual evidence, and DualFact-V, which utilizes video-grounded visual evidence. Tests conducted on the YouCook3-Fact and CraftBench-Fact datasets showed that leading multimodal language models generate fluent yet factually deficient captions, exhibiting systematic omissions and inconsistencies at the role level. Notably, DualFact+ demonstrates a stronger correlation with human assessments of factuality compared to traditional metrics.

Key facts

DualFact+ is a dual-layer multimodal factuality evaluation framework for procedural video captioning.
It separates factual correctness into conceptual facts and contextual facts.
Conceptual facts capture abstract semantic roles: Action, Ingredient, Tool, Location.
Contextual facts capture grounded predicate-argument realizations in video.
The framework includes implicit argument augmentation (VIA) and contrastive fact sets.
DualFact+ has two modes: DualFact-T (textual evidence) and DualFact-V (video-grounded visual evidence).
Experiments used YouCook3-Fact and CraftBench-Fact datasets.
State-of-the-art multimodal language models produce fluent but factually incomplete captions.
DualFact+ correlates more strongly with human factuality judgments than standard metrics.

DualFact+ Framework for Multimodal Fact Verification in Video Captioning

Key facts

Entities

Institutions

Sources