EVA-Bench 2.0 Released
ServiceNow AI has released EVA-Bench Data 2.0, a major expansion of its voice agent evaluation framework. The updated benchmark now covers three enterprise domains, Airline Customer Service Management, Enterprise IT Service Management, and Healthcare Human Resources Service Delivery, spanning 213 distinct scenarios across 121 integrated tools. This represents a fourfold increase in scenario coverage compared to the initial release. Each scenario was rigorously validated against frontier language models, including OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6, ensuring both challenge and fairness in performance assessment. The benchmark architecture addresses a critical industry need: voice agents frequently fail due to domain-specific complexities rather than general capability deficits. To maintain consistency, the framework employs the SyGraph generation pipeline to jointly construct three interdependent components per scenario. These include a structured user goal defined as a decision tree, an initial backend database state, and a ground-truth expected final state. This joint generation prevents silent inconsistencies that typically corrupt evaluation signals. Every scenario is designed with a single correct resolution path, and a multi-stage validation loop verifies policy alignment, authentication flows, and tool executor functionality before release. Five core principles guided the dataset construction. The focus remains strictly voice-first, filtering workflows that reflect actual telephone interactions. Realism is achieved by modeling tool schemas after production APIs and grounding policies in authentic enterprise constraints. Variety is enforced through single-intent, multi-intent, and adversarial call types, including unsatisfiable goals to test model resilience. Authentication mechanisms are dynamically calibrated to task requirements rather than applied uniformly. Finally, reproducibility is guaranteed through deterministic user simulation and explicit edge-case handling. Alongside the new enterprise domains, the release introduces a preview of multilingual support. Recognizing that transcription accuracy and conversational fluency degrade unpredictably across languages, the upcoming update will adapt the entire evaluation pipeline, including localized names, addresses, and cultural context, alongside updated judging metrics. EVA-Bench Data 2.0 is fully open-source under the MIT license. Researchers and developers can access the complete dataset, evaluation framework, and leaderboard via standard repository platforms, with direct loading commands provided for immediate deployment. The framework enables comprehensive bot-to-bot testing and offers a standardized methodology for auditing voice agent reliability across complex enterprise environments.
