UniDoc-RL Framework Advances Visual RAG with Hierarchical Actions and Dense Rewards

ai-technology · 2026-04-20

A novel framework for reinforcement learning, named UniDoc-RL, seeks to overcome the shortcomings of current visual Retrieval-Augmented Generation systems. This method treats the acquisition of visual data as a sequential decision-making challenge, utilizing a hierarchical action space. UniDoc-RL empowers Large Vision-Language Models to conduct retrieval, reranking, active visual perception, and reasoning tasks simultaneously. The system enhances visual evidence from broad document retrieval to precise image selection and active region cropping, enabling models to filter out irrelevant information while focusing on dense data areas. To facilitate effective end-to-end training, a dense multi-reward scheme is introduced for task-specific feedback. This methodology is outlined in arXiv preprint 2604.14967v2, announced as a replacement cross-type submission. The research addresses the issue of generic retrieval signals neglecting crucial fine-grained visual semantics necessary for intricate reasoning. By integrating external visual knowledge into LVLMs through this comprehensive framework, the system aspires to enhance performance on tasks that demand a nuanced understanding of visuals.

Key facts

UniDoc-RL is a unified reinforcement learning framework for visual RAG
It formulates visual information acquisition as sequential decision-making
Uses hierarchical action space for progressive evidence refinement
Enables joint retrieval, reranking, active perception, and reasoning
Progresses from document retrieval to image selection to region cropping
Introduces dense multi-reward scheme for end-to-end training
Addresses limitations of generic retrieval signals in existing systems
Detailed in arXiv preprint 2604.14967v2 announced as replace-cross

Entities

—

Sources

arXiv cs.AI — 2026-04-20