DeepVerifier: Self-Evolving Deep Research Agents via Rubric-Guided Verification
A novel approach for Deep Research Agents (DRAs) is introduced, emphasizing self-evolution through iterative verification rather than merely enhancing policies post-training. This method, known as inference-time scaling of verification, allows an agent to refine its performance by assessing its own outputs against carefully designed rubrics. These rubrics stem from an automatically generated DRA Failure Taxonomy, which categorizes agent failures into five primary groups and thirteen sub-groups. The developed system, DeepVerifier, acts as a rubrics-based outcome reward verifier that utilizes the asymmetry present in verification. In meta-evaluation tests, DeepVerifier surpasses traditional agent-as-judge and LLM judge benchmarks by margins ranging from 12% to 48%.
Key facts
- DeepVerifier uses rubrics-based outcome reward verification.
- Rubrics are derived from a DRA Failure Taxonomy with 5 major and 13 sub-categories.
- Inference-time scaling of verification allows self-evolution.
- DeepVerifier outperforms baselines by 12%-48% in meta-evaluation.
- The approach is an alternative to post-training policy enhancement.
- The agent self-improves by evaluating its generated answers.
- The taxonomy is automatically constructed.
- The system leverages asymmetry of verification.
Entities
Institutions
- arXiv