AISHELL-4 Multi-channel Chinese Conference Speech Database
Date
Size
Publish URL
Categories
AISHELL-4 is a large-scale real-recorded Mandarin speech dataset collected by an 8-channel circular microphone array for speech processing in conference scenarios.The dataset consists of 211 recorded conference sessions, each containing 4 to 8 speakers, with a total duration of 120 hours.The dataset aims to combine advanced research and practical application scenarios of multi-speaker processing from three aspects. Through real recorded meetings, AISHELL-4 provides realistic acoustic effects and rich natural speech features in conversations, such as short pauses, speech overlap, rapid speaker turns, noise, etc. At the same time, AISHELL provides accurate transcriptions and speaker voice activities for each meeting. This enables researchers to explore different aspects of conference processing, from individual tasks such as voice front-end processing, speech recognition, and speaker diarization, to multimodal modeling and joint optimization of related tasks. The research team also released a PyTorch-based training and evaluation framework as a baseline system to promote reproducible research in this field.