Visual RAG Aggregation Loses Key Financial Document Details
A study on arXiv (2605.14581) investigates whether aggregating vision patch tokens into a single vector for Visual RAG in financial documents causes information loss. The researchers developed a diagnostic benchmark where minor digit changes create semantic shifts. Experiments show single-vector aggregation collapses distinct documents into nearly identical vectors, while patch-level detection preserves changes. Global texture dominance is identified as the root cause. Findings are consistent across model scales and retrieval-optimized embeddings.
Key facts
- Study on arXiv:2605.14581 examines aggregation strategies for Visual RAG in financial documents.
- Visual RAG treats documents as images and uses vision encoders to obtain patch tokens.
- Hundreds of patch tokens per document create retrieval and storage challenges.
- Single-vector aggregation collapses different documents with almost identical vectors.
- Patch-level detection preserves semantic changes from minor digit alterations.
- Global texture dominance is the root cause of information loss.
- Findings are consistent across model scales and retrieval-optimized embeddings.
- The study proposes a diagnostic benchmark for financial document retrieval.
Entities
Institutions
- arXiv