PSA-Eval: A Failure-Centered Framework for Evaluating Trilingual Public-Space Agents
The article presents PSA-Eval, a framework focused on runtime evaluation of deployed trilingual public-space agents, emphasizing a shift from traditional input-output scoring to failure analysis in operational systems. This framework broadens the typical evaluation chain by incorporating failure repair and regression testing. A preliminary investigation was carried out on a genuine trilingual digital front-desk system located in the lobby of an international financial institution, utilizing a simplified single-foundation-model approach. The findings revealed group-level cross-language policy drift, which could not be linked to differences in the models used.
Key facts
- PSA-Eval is a failure-centered runtime evaluation framework for trilingual public-space agents.
- The basic unit of analysis shifts from score to failure.
- The framework extends Question -> Answer -> Score -> End to include failure case repair and regression batch.
- Trilingual equivalent inputs are used as controlled probes for cross-language policy drift.
- A pilot study was conducted on a real trilingual digital front-desk system.
- The system is deployed in the lobby of an international financial institution.
- The pilot used a simplified single-foundation-model setting (MA = MB).
- Observed drift should not be interpreted as an A/B foundation-model difference.
Entities
Institutions
- arXiv