SIEVES: Visual Evidence Scoring Boosts MLLM Selective Prediction

ai-technology · 2026-04-30

The recently introduced technique known as SIEVES (Selective Prediction through Visual Evidence Scoring) enhances the dependability of multimodal large language models (MLLMs) in out-of-distribution (OOD) contexts. This method necessitates that reasoner models generate localized visual evidence during responses, while a selector is trained to evaluate the quality of this localization. By providing confidence scores and refraining from responding to low-confidence queries, SIEVES demonstrates up to threefold improvement in coverage across difficult OOD benchmarks, all while maintaining user-defined risk parameters. The research is available on arXiv with the identifier 2604.25855.

Key facts

SIEVES stands for Selective Prediction through Visual Evidence Scoring
Method improves coverage by up to three times on OOD benchmarks
Requires reasoner models to produce localized visual evidence
Selector learns to estimate quality of localization
Targets reliable deployment in real-world out-of-distribution scenarios
Paper available on arXiv with ID 2604.25855
Addresses selective prediction for MLLMs
Uses confidence scoring and abstention mechanism

SIEVES: Visual Evidence Scoring Boosts MLLM Selective Prediction

Key facts

Entities

Institutions

Sources