Trajectory Proper Score for Agentic Uncertainty Quantification

other · 2026-05-26

A new scoring rule, the Trajectory Proper Score (TPS), has been introduced for evaluating uncertainty quantification in language-model agents. Existing methods like AUROC, AUPRC, risk-coverage, Trajectory ECE, and scalarized trajectory scores conflate ranking usefulness with probabilistic truthfulness. TPS is a predictor-agnostic family of strictly proper trajectory-level scoring rules that elicit the full prefix-conditioned success-probability trace. It is proven to strictly elicit the success-probability process under complete observation, and the construction extends to administratively censored trajectories. The work builds on prequential proper scoring and is detailed in arXiv:2605.24756.

Key facts

TPS is a family of strictly proper trajectory-level scoring rules.
Existing methods like AUROC, AUPRC, risk-coverage, Trajectory ECE, and scalarized scores are criticized.
TPS elicits the success-probability trace q_t = P^π(Y=1 | H_t).
TPS is predictor-agnostic.
TPS is proven strictly proper under complete observation.
Extension to administratively censored trajectories is provided.
The work is based on prequential proper scoring.
The paper is available on arXiv with ID 2605.24756.

Trajectory Proper Score for Agentic Uncertainty Quantification

Key facts

Entities

Institutions

Sources