EnterpriseDocBench: A New Benchmark for Multimodal Document AI Pipelines

ai-technology · 2026-04-30

Researchers have developed EnterpriseDocBench, a unified evaluation framework for enterprise document AI pipelines that assesses parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness on the same corpus. The corpus comprises public, permissively licensed documents from six enterprise domains (five in the pilot). Three pipelines were tested: BM25, dense embedding, and a hybrid, all using the same GPT-5 generator. Results show hybrid retrieval narrowly outperforms BM25 (nDCG@5 of 0.92 vs. 0.91), while dense embedding lags at 0.83. Hallucination rates are not monotonic with document length: short and very long documents hallucinate more (28.1% and 23.8%) than medium-length ones (9.2%). Cross-stage correlations are very weak, indicating that optimizing individual stages does not guarantee overall pipeline performance. The work is described in arXiv:2604.26382.

Key facts

EnterpriseDocBench evaluates parsing, indexing, retrieval, and generation on the same corpus.
Corpus includes six enterprise domains, five in the current pilot.
Three pipelines tested: BM25, dense embedding, and hybrid.
All pipelines use GPT-5 as the generator.
Hybrid retrieval achieves nDCG@5 of 0.92, BM25 0.91, dense embedding 0.83.
Hallucination rates: short docs 28.1%, very long docs 23.8%, medium docs 9.2%.
Cross-stage correlations are very weak.
Paper available on arXiv with ID 2604.26382.

EnterpriseDocBench: A New Benchmark for Multimodal Document AI Pipelines

Key facts

Entities

Institutions

Sources