Command Palette
Search for a command to run...
EVA-Bench : un nouveau cadre d'évaluation de bout en bout pour les agents vocaux
EVA-Bench : un nouveau cadre d'évaluation de bout en bout pour les agents vocaux
Résumé
Les agents vocaux, systèmes d'intelligence artificielle menant des conversations parlées pour accomplir des tâches, sont de plus en plus déployés dans les applications d'entreprise. Cependant, aucun banc d'essai existant ne traite conjointement deux défis fondamentaux de l'évaluation : générer des conversations simulées réalistes et mesurer la qualité sur l'ensemble des modes de défaillance propres à la voix. Nous présentons EVA-Bench, un cadre d'évaluation de bout en bout qui répond à ces deux besoins. Côté simulation, EVA-Bench orchestre des conversations audio de bot à bot sur des dialogues dynamiques à tours multiples, avec une validation automatique de la simulation qui détecte les erreurs du simulateur utilisateur et régénère les conversations de manière appropriée avant la notation. Côté mesure, EVA-Bench introduit deux métriques composites : EVA-A (Précision), qui capture l'achèvement des tâches, la fidélité et la qualité de la parole au niveau audio ; et EVA-X (Expérience), qui capture la progression de la conversation, la concision orale et la gestion des tours de parole. Les deux métriques s'appliquent à toutes les architectures d'agents majeures, permettant une comparaison directe entre architectures. EVA-Bench comprend 213 scénarios répartis dans trois domaines d'entreprise, une suite de perturbations contrôlées pour la robustesse aux accents et au bruit, ainsi que des mesures pass@1, pass@k et pass^k qui distinguent la performance maximale de la performance fiable. Sur 12 systèmes couvrant les trois architectures, nous constatons : (1) aucun système ne dépasse simultanément 0,5 sur les deux métriques \mathbf{EVA-A}{\mathbf{pass@1}}} et EVA−Xpass@1 ; (2) les performances maximale et fiable divergent substantiellement (écart médian pass@k–pass^k de 0,44 sur EVA-A) ; et (3) les perturbations d'accent et de bruit exposent des lacunes de robustesse importantes, avec des effets variant selon les architectures, les systèmes et les métriques (∆ moyen jusqu'à 0,314). Nous publions l'intégralité du cadre, de la suite d'évaluation et des données de référence sous licence open-source.
One-sentence Summary
ServiceNow's EVA-Bench framework evaluates voice agents through validated bot-to-bot audio conversations and composite metrics EVA−Apass @1 (Accuracy) and egin{array} { r } { \mathsf { E V A - X } _ { \mathsf { p a s s } \ @ 1 } ; } \end{array} (Experience), applied to 213 enterprise scenarios and 12 systems, finding no system exceeds 0.5 on both and revealing robustness gaps under accent/noise perturbations; the framework is open-source.
Key Contributions
- EVA-Bench provides a validation-gated bot-to-bot simulation framework that automatically detects user simulator errors and regenerates conversations, ensuring reliable multi-turn audio interactions for voice agent evaluation.
- It defines two composite metrics: EVA-A (Accuracy) for task completion, faithfulness, and audio-level speech fidelity, and EVA-X (Experience) for conversation progression, spoken conciseness, and turn-taking timing, enabling direct comparison across all major agent architectures.
- The benchmark includes 213 enterprise scenarios across three domains, a controlled perturbation suite for accent and noise robustness, and multi-trial consistency metrics (pass@1, pass@k, pass^k) to distinguish peak from reliable performance.
Introduction
Voice agents must handle spoken conversations under constraints that have no direct text analog, such as ephemeral linear speech, real-time turn-taking, and variable acoustic conditions. Existing text-based benchmarks cannot capture critical voice-specific failure modes like policy violations or spoken entity errors, and prior voice benchmarks rely on scripted or single-turn interactions without validated simulation consistency or comprehensive quality measurement. The authors introduce EVA-Bench, an end-to-end evaluation framework that uses validation-gated bot-to-bot simulation with controlled acoustic perturbations and defines joint accuracy (EVA-A) and experience (EVA-X) metrics, enabling rigorous comparison of cascade and audio-native voice agents under identical conditions.
Dataset
The EVA-Bench dataset is a purpose-built benchmark for evaluating voice agents in task-oriented enterprise scenarios. It is designed to surface voice-specific failures (e.g., misheard codes or names) under controlled acoustic and behavioral conditions.
-
Composition and sources:
-
Three domains: Airline Customer Service Management (CSM), Healthcare Human Resources Service Delivery (HRSD), Enterprise IT Service Management (ITSM).
-
Each domain contains multiple scenarios, each focused on a high-contact task such as flight rebooking.
-
A scenario consists of a user goal with explicit constraints and a decision tree, a user persona (speaking style, patience), a scenario database that the agent’s tools can query and modify, and ground truth indicating the expected final database state.
-
Key details and filtering rules:
-
Scenarios are handcrafted by the authors to reflect realistic enterprise use cases; no external sources are mentioned.
-
Selection emphasizes tasks where users are most likely to call an agent, and includes entities (confirmation codes, IDs, names) that are commonly misheard in spoken interactions.
-
The decision tree eliminates ambiguity, enabling repeatable evaluation.
-
The text does not provide exact scenario counts; full construction details appear in Appendix C.
-
How the dataset is used:
-
EVA-Bench serves as an evaluation-only benchmark, not a training set.
-
The authors conduct fully automated bot-to-bot conversations: a user simulator takes the scenario’s goal, decision tree, and persona, and communicates with the agent over a live audio WebSocket.
-
Both sides interact solely through audio, making the benchmark compatible with cascade and audio-native architectures.
-
A perturbation suite independently varies acoustic factors (accent, background noise, connection quality) and behavioral factors (personality, speaking style) to isolate each factor’s impact.
-
Processing and validation:
-
Each simulated conversation is automatically validated before metrics are computed.
-
“User Behavioral Fidelity” uses an LLM-as-Judge to verify the simulator followed the intended user goal.
-
“User Speech Fidelity” uses a nearly identical LLM-as-Judge to check that the spoken audio matched the intended content.
-
Conversations failing either check are regenerated; about 12 % of trials required reruns, almost entirely due to user behavioral drift.
Method
The authors propose EVA-Bench, acomprehensive framework for evaluating voice agents through fully automated, multi-turn bot-to-bot conversations. The overall architecture orchestrates parallel audio sessions over a live WebSocket, enabling the evaluation of both cascade and audio-native architectures under identical conditions.
The overall framework is illustrated below:
The simulation process begins with specific inputs and conditions. The system utilizes a dataset of enterprise scenarios spanning domains like Airline Customer Service Management, Healthcare HR Service Delivery, and Enterprise IT Service Management. Each scenario provides a user goal with explicit constraints, a decision tree to eliminate ambiguity, and a scenario database. The framework also applies controlled perturbations, independently varying acoustic conditions such as accents and background noise, and behavioral conditions like personality and speaking style to disentangle their effects on agent performance.
During the conversation simulation, a User Simulator configured with a scenario-specific goal, persona, and conversational text-to-speech voice interacts with the Voice Agent under test. The Voice Agent supports multiple architectures, including cascade and audio-native pipelines. A deterministic Tool Executor handles all agent tool calls, modifying the trial-specific environment.
Before any evaluation metrics are computed, the completed conversations pass through an automated Simulation Validation phase. This module checks for per-trial conversation artifacts using two primary judges. User Behavioral Fidelity employs an LLM-as-Judge to verify that the simulator faithfully executed its assigned goal without deviations. User Speech Fidelity uses a LALM-as-Judge to ensure the simulator's spoken audio accurately conveyed its intended content. Conversations failing these checks are automatically regenerated, ensuring that evaluation scores reflect agent behavior rather than simulator artifacts.
Valid conversations then enter the Voice Agent Quality Measurements phase, which evaluates performance across three layered metric categories. EVA-A measures Accuracy through Task Completion via deterministic database state hashing, Faithfulness using an LLM-as-Judge to ensure actions remain grounded in instructions and tool results, and Speech Fidelity to verify the accurate spoken reproduction of high-stakes named entities. EVA-X assesses Experience by evaluating Conversation Progression, Conciseness, and Turn-Taking, which uses timestamp-based scoring to measure interruption and latency. Additionally, Diagnostic Metrics provide granular failure analysis, such as verifying the transcription accuracy of key entities.
To capture both average and consistent performance, the authors aggregate these metrics into pass@1, pass@k, and pass^k scores. A conversation passes a dimension only if every metric meets its specific threshold τm. For example, passing accuracy requires task completion equal to 1.0, faithfulness ≥0.5, and speech fidelity ≥0.95. The pass@1 metric measures average performance across trials, pass@k measures ceiling performance by checking if at least one of k trials passes, and pass^k measures reliability by calculating the probability that the system passes all k independent trials.
Experiment
The evaluation compared 12 voice agent systems (cascade, hybrid, and speech-to-speech) under clean and perturbed conditions—accented speech, background noise, and both—using EVA-Bench, with reliability checks confirming that metric scores reflect genuine behavioral differences rather than judge noise. No system jointly excels in accuracy and experience, revealing a clear accuracy-experience trade-off in cascade architectures, while speech-to-speech models lead in turn-taking but suffer experience degradation under noise. Robustness analysis shows divergent failure modes: cascade systems are vulnerable to accuracy drops from accented speech and noise, and speech-to-speech systems experience failures under noise. Failure analysis identifies transcription accuracy as a key bottleneck for cascade task completion, and faithfulness issues are decoupled from task success, underscoring the need for independent evaluation dimensions.
EVA-Bench is the only evaluation framework that combines live multi-turn simulation, realistic audio, and comprehensive metrics across both speech-to-speech and cascade architectures, enabling detailed failure analysis. Experiments show that cascade systems exhibit highly variable robustness, with transcription accuracy on key entities strongly correlating with task completion, while speech-to-speech systems lead in turn-taking but lag in policy adherence. Faithfulness failures are common even in technically successful conversations, and turn-taking is the most perturbation-sensitive metric. Turn-taking is the most perturbation-sensitive metric, with the vast majority of measurements showing significant degradation, underscoring the fragility of conversational timing. Faithfulness failures are pervasive: even when tasks are completed, over two-thirds of conversations contain policy deviations or hallucinations, motivating independent faithfulness evaluation. Transcription accuracy on key entities is a strong predictor of task completion for cascade systems, with below-threshold accuracy leading to substantially lower success rates.
No system clears 0.5 on both accuracy and experience pass@1 simultaneously, and only one system surpasses 0.4 on both. Cascade systems display a clear accuracy–experience trade-off, where the highest-accuracy cascades incur tool-call latencies above 5 seconds while lower-latency cascades lose accuracy, and no cascade exceeds 0.25 on both axes. Peak single-trial pass rates (pass@k) substantially overstate reliability, with a median drop of 0.44 points when requiring five successive correct trials (pass^k). Only GPT-Realtime-1.5 clears 0.4 on both EVA-A and EVA-X pass@1, with scores of 0.47 and 0.57. Turn-taking scores separate cleanly by architecture: speech-to-speech systems achieve 0.82–0.83, while cascades span 0.28–0.58. The three most accurate cascade systems have tool-call turn latencies exceeding 5 s, whereas the two faster cascades stay below 2.7 s but with lower accuracy. No cascade system exceeds 0.25 on both accuracy and experience, and confidence intervals do not overlap. The median gap between peak pass@k and reliable pass^k across all systems is 0.44 on accuracy, showing single-trial scores overestimate deployment readiness. Speech fidelity remains high (mean ≥0.954), but failures predominantly involve alphanumeric entity mispronunciations.
EVA-Bench is a live multi-turn evaluation framework with realistic audio that compares speech-to-speech and cascade architectures. Experiments reveal that cascade systems have variable robustness, with transcription accuracy on key entities predicting task completion, while speech-to-speech models excel at turn-taking but often deviate from policy. Faithfulness failures are common even in successful conversations, turn-taking is highly sensitive to perturbations, and no system simultaneously achieves high accuracy and experience; cascade systems face a trade-off between accuracy and latency, and single-trial scores overstate deployment readiness.