Model-Level Alignment Evaluation Insufficient for Deployment Claims
A recent study published on arXiv contends that alignment relevant to deployment cannot solely be determined through model-level assessments. The authors suggest that alignment assertions should be categorized based on the evidence level: model, response, interaction, or deployment. An organized review of eleven alignment benchmarks, which was expanded to include a total of sixteen benchmarks and evaluated using an eight-dimension rubric (Cohen's kappa = 0.87), revealed a lack of user-facing verification support across all benchmarks analyzed, with minimal process steerability. The few interaction benchmarks found include tau-be. This paper questions the common practice of relying on model-level metrics to substantiate claims regarding alignment in deployed systems.
Key facts
- Paper argues deployment-relevant alignment cannot be inferred from model-level evaluation alone.
- Alignment claims should be indexed to model-level, response-level, interaction-level, or deployment-level.
- Structured audit of 11 alignment benchmarks extended to 16-benchmark corpus.
- Dual-coded against eight-dimension rubric with Cohen's kappa = 0.87.
- User-facing verification support absent across every benchmark examined.
- Process steerability nearly absent in benchmarks.
- Few interactional benchmarks identified, including tau-be.
- Paper challenges use of model-level scores for deployment claims.
Entities
Institutions
- arXiv