SAVER: Selective Vision-as-Needed Framework for Multimodal IE

ai-technology · 2026-05-22

Researchers have introduced SAVER, a framework designed for selective vision-as-needed, aimed at enhancing multimodal named entity recognition (MNER) and multimodal relation extraction (MRE) within social media contexts. This innovative method tackles the issue of multiple images in a post that may be irrelevant, redundant, or misleading. SAVER incorporates a Conformal Groundability Gate (CGG) to assess span-level visual groundability for MNER and to generate pair-level activation for MRE, utilizing a conformal procedure with Clopper-Pearson upper bounds to adjust thresholds. When engaged, a submodular relevance-diversity selector identifies a concise set of images that deliver reliable evidence, thereby minimizing computational inefficiencies and preventing the amplification of misleading visual signals.

Key facts

SAVER is a selective vision-as-needed framework for multimodal IE.
It targets MNER and MRE in social media.
Multiple images per post can be weakly related, redundant, or misleading.
Always-on multimodal fusion wastes computation and amplifies spurious cues.
CGG estimates span-level visual groundability in MNER.
CGG derives pair-level activation in MRE from two marked entities.
Activation threshold is calibrated on a held-out split via conformal procedure with Clopper-Pearson upper bounds.
Submodular relevance-diversity selector chooses a small subset of images when activated.

Entities

—

Sources

arXiv cs.AI — 2026-05-21