SIEVES: Visual Evidence Scoring Boosts MLLM Selective Prediction
The recently introduced technique known as SIEVES (Selective Prediction through Visual Evidence Scoring) enhances the dependability of multimodal large language models (MLLMs) in out-of-distribution (OOD) contexts. This method necessitates that reasoner models generate localized visual evidence during responses, while a selector is trained to evaluate the quality of this localization. By providing confidence scores and refraining from responding to low-confidence queries, SIEVES demonstrates up to threefold improvement in coverage across difficult OOD benchmarks, all while maintaining user-defined risk parameters. The research is available on arXiv with the identifier 2604.25855.
Key facts
- SIEVES stands for Selective Prediction through Visual Evidence Scoring
- Method improves coverage by up to three times on OOD benchmarks
- Requires reasoner models to produce localized visual evidence
- Selector learns to estimate quality of localization
- Targets reliable deployment in real-world out-of-distribution scenarios
- Paper available on arXiv with ID 2604.25855
- Addresses selective prediction for MLLMs
- Uses confidence scoring and abstention mechanism
Entities
Institutions
- arXiv