EVA-Bench: New Framework for Evaluating Voice Agents

ai-technology · 2026-05-14

EVA-Bench has been developed by researchers as a comprehensive evaluation framework for voice agents—AI systems that engage in spoken dialogues to accomplish tasks. This framework tackles two primary issues: the creation of realistic simulated conversations and the assessment of quality across voice-related failure modes. For simulation, EVA-Bench facilitates bot-to-bot audio exchanges through dynamic multi-turn dialogues, incorporating automatic validation to identify user simulator mistakes and regenerate conversations prior to evaluation. Regarding measurement, it presents two composite metrics: EVA-A (Accuracy), which assesses task completion, fidelity, and audio-level speech quality; and EVA-X (Experience), which evaluates conversation flow, spoken brevity, and timing of turn-taking. This framework is tailored for enterprise applications where voice agents are increasingly utilized. The findings were made available on arXiv as preprint 2605.13841.

Key facts

EVA-Bench is an end-to-end evaluation framework for voice agents.
It addresses generating realistic simulated conversations and measuring quality across voice-specific failure modes.
Simulation side: bot-to-bot audio conversations over dynamic multi-turn dialogues with automatic validation.
Measurement side: two composite metrics EVA-A (Accuracy) and EVA-X (Experience).
EVA-A captures task completion, faithfulness, and audio-level speech fidelity.
EVA-X captures conversation progression, spoken conciseness, and turn-taking timing.
Voice agents are AI systems that conduct spoken conversations to complete tasks.
The framework targets enterprise applications.

EVA-Bench: New Framework for Evaluating Voice Agents

Key facts

Entities

Institutions

Sources