SIRA: Training-Free Method to Reduce LVLM Hallucinations
A new approach called SIRA (Shared-Prefix Internal Reconstruction of Attribution) has been introduced by researchers as a training-free internal contrastive decoding framework aimed at reducing hallucinations in large vision-language models (LVLMs). Unlike current contrastive decoding techniques that rely on comparing predictions from original images with externally altered visual inputs—which can lead to off-manifold artifacts and necessitate expensive additional forward passes—SIRA generates a counterfactual reference within the same LVLM. This is achieved by utilizing the staged information flow of multimodal transformers, allowing image and text tokens to interact via a shared prefix, thereby creating an aligned multimodal state. The methodology is detailed in the arXiv paper 2605.14621.
Key facts
- SIRA is a training-free internal contrastive decoding framework.
- It mitigates hallucinations in LVLMs without external perturbations.
- It uses a shared prefix to form an aligned multimodal state.
- It forks a counterfactual branch in later transformer layers.
- The method avoids off-manifold artifacts and extra forward passes.
- The paper is available on arXiv with ID 2605.14621.
- The approach exploits staged information flow in multimodal transformers.
- SIRA preserves prompt interpretation, decoding history, positional structure, and early visual grounding.
Entities
Institutions
- arXiv