6 months ago

Abstract

Spoken Dialogue Models (SDMs) have recently attracted significant attentionfor their ability to generate voice responses directly to users' spokenqueries. Despite their increasing popularity, there exists a gap in researchfocused on comprehensively understanding their practical effectiveness incomprehending and emulating human conversations. This is especially truecompared to text-based Large Language Models (LLMs), which benefit fromextensive benchmarking. Human voice interactions are inherently more complexthan text due to characteristics unique to spoken dialogue. Ambiguity poses onechallenge, stemming from semantic factors like polysemy, as well asphonological aspects such as heterograph, heteronyms, and stress patterns.Additionally, context-dependency, like omission, coreference, and multi-turninteraction, adds further complexity to human conversational dynamics. Toilluminate the current state of SDM development and to address thesechallenges, we present a benchmark dataset in this paper, which comprises 1,079instances in English and Chinese. Accompanied by an LLM-based evaluation methodthat closely aligns with human judgment, this dataset facilitates acomprehensive exploration of the performance of SDMs in tackling thesepractical challenges.

Source PDF View Code