UniDoc-RL Framework Advances Visual RAG with Hierarchical Actions and Dense Rewards
A novel framework for reinforcement learning, named UniDoc-RL, seeks to overcome the shortcomings of current visual Retrieval-Augmented Generation systems. This method treats the acquisition of visual data as a sequential decision-making challenge, utilizing a hierarchical action space. UniDoc-RL empowers Large Vision-Language Models to conduct retrieval, reranking, active visual perception, and reasoning tasks simultaneously. The system enhances visual evidence from broad document retrieval to precise image selection and active region cropping, enabling models to filter out irrelevant information while focusing on dense data areas. To facilitate effective end-to-end training, a dense multi-reward scheme is introduced for task-specific feedback. This methodology is outlined in arXiv preprint 2604.14967v2, announced as a replacement cross-type submission. The research addresses the issue of generic retrieval signals neglecting crucial fine-grained visual semantics necessary for intricate reasoning. By integrating external visual knowledge into LVLMs through this comprehensive framework, the system aspires to enhance performance on tasks that demand a nuanced understanding of visuals.
Key facts
- UniDoc-RL is a unified reinforcement learning framework for visual RAG
- It formulates visual information acquisition as sequential decision-making
- Uses hierarchical action space for progressive evidence refinement
- Enables joint retrieval, reranking, active perception, and reasoning
- Progresses from document retrieval to image selection to region cropping
- Introduces dense multi-reward scheme for end-to-end training
- Addresses limitations of generic retrieval signals in existing systems
- Detailed in arXiv preprint 2604.14967v2 announced as replace-cross
Entities
—