ARTFEED — Contemporary Art Intelligence

PSA-Eval: A Failure-Centered Framework for Evaluating Trilingual Public-Space Agents

other · 2026-04-29

The article presents PSA-Eval, a framework focused on runtime evaluation of deployed trilingual public-space agents, emphasizing a shift from traditional input-output scoring to failure analysis in operational systems. This framework broadens the typical evaluation chain by incorporating failure repair and regression testing. A preliminary investigation was carried out on a genuine trilingual digital front-desk system located in the lobby of an international financial institution, utilizing a simplified single-foundation-model approach. The findings revealed group-level cross-language policy drift, which could not be linked to differences in the models used.

Key facts

  • PSA-Eval is a failure-centered runtime evaluation framework for trilingual public-space agents.
  • The basic unit of analysis shifts from score to failure.
  • The framework extends Question -> Answer -> Score -> End to include failure case repair and regression batch.
  • Trilingual equivalent inputs are used as controlled probes for cross-language policy drift.
  • A pilot study was conducted on a real trilingual digital front-desk system.
  • The system is deployed in the lobby of an international financial institution.
  • The pilot used a simplified single-foundation-model setting (MA = MB).
  • Observed drift should not be interpreted as an A/B foundation-model difference.

Entities

Institutions

  • arXiv

Sources