HyperAIHyperAI
17 days ago

Vietnamese end-to-end speech recognition using wav2vec 2.0

{Thai Binh Nguyen}
Abstract

Our models are pre-trained on 13k hours of Vietnamese youtube audio (un-label data) and fine-tuned on 250 hours labeled of VLSP ASR dataset on 16kHz sampled speech audio. We use wav2vec2 architecture for the pre-trained model. For fine-tuning phase, wav2vec2 is fine-tuned using Connectionist Temporal Classification (CTC), which is an algorithm that is used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition. On the Vivos dataset, we achieved a WER score of 6.15

Vietnamese end-to-end speech recognition using wav2vec 2.0 | Latest Papers | HyperAI