Code-Switch ASR Benchmark
Enterprise voice agents increasingly serve bilingual populations, yet automatic speech recognition remains vulnerable to code-switching, where speakers alternate languages mid-sentence. To address this operational gap, developers of the AU-Harness evaluation framework engineered a specialized benchmark and dataset. The study targets four high-demand enterprise language configurations: Spanish-English, French-English, Canadian French-English, and German-English. The corpus aggregates thousands of synthesized utterances derived from human resources and IT service management workflows, carefully filtered to preserve natural speech patterns and control for entity dominance. The evaluation tracks three primary metrics: Word Error Rate for baseline transcription accuracy, Semantic Word Error Rate for meaning preservation, and Answer Error Rate to measure downstream task reliability. Seven leading ASR systems were assessed, spanning frontier large audio language models, proprietary cloud APIs, and open-source architectures. Results indicate that transcription performance varies significantly by language pair and model architecture. ElevenLabs Scribe V2, Google Gemini 3 Flash, and AssemblyAI Universal 3-Pro emerged as the market leaders, maintaining the lowest error rates across all measured dimensions. The analysis quantified the operational cost of code-switching by comparing ASR outputs against monolingual baselines. Top-tier systems incurred minimal degradation, with Scribe V2 even surpassing its native-language baseline, demonstrating strong architectural robustness. Conversely, legacy or narrowly trained models experienced substantial performance drops. Statistical error analysis reveals that language-switch frequency primarily determines the probability of transcription failure, while lexical mixing density dictates error severity. Counterintuitively, mistakes concentrate on English segments embedded within non-English matrix utterances, likely due to abrupt phonological shifts, embedded technical terminology, or mid-utterance context adaptation challenges. Downstream impact assessments confirm a tight correlation between surface accuracy and functional reliability, proving that precise transcription directly preserves task-critical details like ticket identifiers and policy inquiries. However, certain mid-tier systems exhibited a dangerous divergence between raw accuracy and functional comprehension, where minor semantic errors severely compromised downstream question-answering tasks. Organizations deploying voice infrastructure must treat code-switching as a core reliability requirement rather than a peripheral edge case. The benchmark demonstrates that frontier ASR capabilities have matured sufficiently to support natural bilingual communication with negligible friction. Nevertheless, performance remains highly context-dependent. Enterprise architects should rigorously benchmark specific language combinations relevant to their customer demographics before production deployment, as optimal routing for Spanish-English workflows may underperform on German-English call volumes. Strategic model selection and continuous multilingual validation will remain critical for maintaining transcription integrity in global contact centers.
