Trajectory Proper Score for Agentic Uncertainty Quantification
A new scoring rule, the Trajectory Proper Score (TPS), has been introduced for evaluating uncertainty quantification in language-model agents. Existing methods like AUROC, AUPRC, risk-coverage, Trajectory ECE, and scalarized trajectory scores conflate ranking usefulness with probabilistic truthfulness. TPS is a predictor-agnostic family of strictly proper trajectory-level scoring rules that elicit the full prefix-conditioned success-probability trace. It is proven to strictly elicit the success-probability process under complete observation, and the construction extends to administratively censored trajectories. The work builds on prequential proper scoring and is detailed in arXiv:2605.24756.
Key facts
- TPS is a family of strictly proper trajectory-level scoring rules.
- Existing methods like AUROC, AUPRC, risk-coverage, Trajectory ECE, and scalarized scores are criticized.
- TPS elicits the success-probability trace q_t = P^π(Y=1 | H_t).
- TPS is predictor-agnostic.
- TPS is proven strictly proper under complete observation.
- Extension to administratively censored trajectories is provided.
- The work is based on prequential proper scoring.
- The paper is available on arXiv with ID 2605.24756.
Entities
Institutions
- arXiv