MED-VRAG: Iterative Multimodal RAG for Medical QA
Researchers have unveiled a new system called MED-VRAG, which uses a unique method that combines different retrieval techniques to focus on analyzing images of PMC document pages instead of text converted by OCR. This innovative approach incorporates ColQwen2.5 patch-level page embeddings and a MapReduce LLM filter, enabling it to handle about 350,000 pages while keeping the initial retrieval time under 30 milliseconds through a specially designed index. A vision-language model further enhances the process by refining queries and collecting evidence over three rounds of reasoning, taking about 15.9 seconds per round, with a total of 47.8 seconds needed for all three rounds on 4xA100. The system's effectiveness is evaluated against four medical QA benchmarks: MedQA, MedMCQA, PubMedQA, and MMLU-M.
Key facts
- MED-VRAG is an iterative multimodal RAG framework.
- It retrieves and reasons over PMC document page images.
- It uses ColQwen2.5 patch-level page embeddings.
- It employs a sharded MapReduce LLM filter.
- Scales to ~350K pages.
- Stage-1 retrieval under 30 ms via coarse-to-fine index.
- VLM iteratively refines query across up to 3 rounds.
- Evaluated on MedQA, MedMCQA, PubMedQA, MMLU-M.
Entities
Institutions
- arXiv