Chain of Evidence: Visual Attribution for Iterative RAG

ai-technology · 2026-05-06

Researchers propose Chain of Evidence (CoE), a visual attribution framework for Iterative Retrieval-Augmented Generation (iRAG) that uses Vision-Language Models to reason directly over screenshots of retrieved documents. CoE addresses coarse-grained text-level citations and visual semantic loss from parsing visually rich documents like slides and PDFs. It outputs precise bounding boxes for evidence, eliminating format-specific parsing. The system is retriever-agnostic and aims to improve multi-hop question answering by preserving spatial logic and layout cues.

Key facts

Chain of Evidence (CoE) is a visual attribution framework for iRAG.
CoE uses Vision-Language Models to reason over document screenshots.
It addresses coarse-grained text citations and visual semantic loss.
CoE outputs precise bounding boxes for evidence.
It is retriever-agnostic and eliminates format-specific parsing.
The framework targets multi-hop question answering.
CoE preserves spatial logic and layout cues from visually rich documents.
The research is published on arXiv with ID 2605.01284.

Chain of Evidence: Visual Attribution for Iterative RAG

Key facts

Entities

Institutions

Sources