HyperAIHyperAI

WenetSpeech Yue Cantonese Corpus Dataset

Date

11 days ago

Organization

AISHELL
China Telecom
Northwestern Polytechnical University

Publish URL

huggingface.co

Paper URL

2509.03959

License

非商业用途

Download Help

WenetSpeech Yue is a multi-dimensional annotated large-scale speech corpus for Cantonese speech recognition (ASR) and text-to-speech synthesis (TTS) released in 2025 by Northwestern Polytechnical University, China Telecom Artificial Intelligence Research Institute, Beijing Hill Shell Technology Co., Ltd. and other institutions. The related paper results are "WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation", which aims to fill the gap in the lack of resources in the Cantonese field and promote the training and evaluation of high-quality Cantonese models.

The dataset contains approximately 21,800 hours of Cantonese recordings, covering 10 domains, including: storytelling, entertainment, drama, culture, Vlog, commentary, education, podcasts, news, and others. It is suitable for the training and evaluation of Cantonese automatic speech recognition (ASR) and text-to-speech synthesis (TTS) models, as well as for processing diverse domains and speaking styles in real language scenarios. It also supports the verification and evaluation of cross-domain generalization capabilities.

Data composition:

  • Transcribed text: Automatic speech recognition results;
  • Confidence scores: such as text confidence and Cantonese pinyin confidence;
  • Speaker attributes: gender, age, speaker ID;
  • Voice quality indicators: such as SNR and DNSMOS;
  • Time annotation: duration, character-level timestamp;
  • Extended metadata: program name, region, link and register information.