ARTFEED — Contemporary Art Intelligence

Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems

ai-technology · 2026-05-09

The Partial Evidence Bench, a novel deterministic benchmark, assesses the failure mode in which enterprise agents generate what appear to be comprehensive answers while lacking crucial evidence because of access control limitations. This benchmark encompasses 72 tasks divided into three scenario categories: due diligence, compliance audit, and security incident response. It includes ACL-partitioned datasets, oracle complete responses, oracle authorized-view responses, oracle completeness assessments, and structured gap-report oracles. Systems are judged based on answer accuracy, awareness of completeness, quality of gap reports, and unsafe completeness behavior. Baseline evaluations reveal that silent filtering poses a significant safety risk.

Key facts

  • Partial Evidence Bench is a deterministic benchmark for authorization-limited evidence in agentic systems.
  • It addresses the failure mode where access control is enforced correctly but answers appear complete despite missing evidence.
  • The benchmark includes 72 tasks across three scenario families: due diligence, compliance audit, and security incident response.
  • It uses ACL-partitioned corpora and provides oracle answers for complete and authorized views.
  • Evaluation covers four surfaces: answer correctness, completeness awareness, gap-report quality, and unsafe completeness behavior.
  • Checked-in baselines show silent filtering is catastrophically unsafe.
  • The benchmark ships with oracle completeness judgments and structured gap-report oracles.
  • The paper is available on arXiv under reference 2605.05379.

Entities

Institutions

  • arXiv

Sources