Chain of Evidence: Visual Attribution for Iterative RAG
Researchers propose Chain of Evidence (CoE), a visual attribution framework for Iterative Retrieval-Augmented Generation (iRAG) that uses Vision-Language Models to reason directly over screenshots of retrieved documents. CoE addresses coarse-grained text-level citations and visual semantic loss from parsing visually rich documents like slides and PDFs. It outputs precise bounding boxes for evidence, eliminating format-specific parsing. The system is retriever-agnostic and aims to improve multi-hop question answering by preserving spatial logic and layout cues.
Key facts
- Chain of Evidence (CoE) is a visual attribution framework for iRAG.
- CoE uses Vision-Language Models to reason over document screenshots.
- It addresses coarse-grained text citations and visual semantic loss.
- CoE outputs precise bounding boxes for evidence.
- It is retriever-agnostic and eliminates format-specific parsing.
- The framework targets multi-hop question answering.
- CoE preserves spatial logic and layout cues from visually rich documents.
- The research is published on arXiv with ID 2605.01284.
Entities
Institutions
- arXiv