EVA-Bench: New Framework for Evaluating Voice Agents
EVA-Bench has been developed by researchers as a comprehensive evaluation framework for voice agents—AI systems that engage in spoken dialogues to accomplish tasks. This framework tackles two primary issues: the creation of realistic simulated conversations and the assessment of quality across voice-related failure modes. For simulation, EVA-Bench facilitates bot-to-bot audio exchanges through dynamic multi-turn dialogues, incorporating automatic validation to identify user simulator mistakes and regenerate conversations prior to evaluation. Regarding measurement, it presents two composite metrics: EVA-A (Accuracy), which assesses task completion, fidelity, and audio-level speech quality; and EVA-X (Experience), which evaluates conversation flow, spoken brevity, and timing of turn-taking. This framework is tailored for enterprise applications where voice agents are increasingly utilized. The findings were made available on arXiv as preprint 2605.13841.
Key facts
- EVA-Bench is an end-to-end evaluation framework for voice agents.
- It addresses generating realistic simulated conversations and measuring quality across voice-specific failure modes.
- Simulation side: bot-to-bot audio conversations over dynamic multi-turn dialogues with automatic validation.
- Measurement side: two composite metrics EVA-A (Accuracy) and EVA-X (Experience).
- EVA-A captures task completion, faithfulness, and audio-level speech fidelity.
- EVA-X captures conversation progression, spoken conciseness, and turn-taking timing.
- Voice agents are AI systems that conduct spoken conversations to complete tasks.
- The framework targets enterprise applications.
Entities
Institutions
- arXiv