Model-Level Alignment Evaluation Insufficient for Deployment Claims

publication · 2026-05-07

A recent study published on arXiv contends that alignment relevant to deployment cannot solely be determined through model-level assessments. The authors suggest that alignment assertions should be categorized based on the evidence level: model, response, interaction, or deployment. An organized review of eleven alignment benchmarks, which was expanded to include a total of sixteen benchmarks and evaluated using an eight-dimension rubric (Cohen's kappa = 0.87), revealed a lack of user-facing verification support across all benchmarks analyzed, with minimal process steerability. The few interaction benchmarks found include tau-be. This paper questions the common practice of relying on model-level metrics to substantiate claims regarding alignment in deployed systems.

Key facts

Paper argues deployment-relevant alignment cannot be inferred from model-level evaluation alone.
Alignment claims should be indexed to model-level, response-level, interaction-level, or deployment-level.
Structured audit of 11 alignment benchmarks extended to 16-benchmark corpus.
Dual-coded against eight-dimension rubric with Cohen's kappa = 0.87.
User-facing verification support absent across every benchmark examined.
Process steerability nearly absent in benchmarks.
Few interactional benchmarks identified, including tau-be.
Paper challenges use of model-level scores for deployment claims.

Model-Level Alignment Evaluation Insufficient for Deployment Claims

Key facts

Entities

Institutions

Sources