Automatic Speech Recognition
Automatic speech recognition technology is a technology that converts human speech into text. Due to the diversity and complexity of speech signals, the current speech recognition system can only achieve satisfactory performance under certain restrictions (it can only be applied to certain specific occasions).
Automatic speech recognition definition
The goal of automatic speech recognition technology is to enable computers to "dictate" continuous speech spoken by different people, which is commonly known as "speech dictation machine". It is a technology that realizes the conversion of "sound" to "text".
Performance Influencing Factors
The performance of a speech recognition system generally depends on the following four factors:
- Recognize the size of the vocabulary and the complexity of the speech;
- The quality of the speech signal;
- Single speaker or multiple speakers;
- hardware.
Automatic speech recognition classification
Automatic speech recognition is usually classified in the following ways:
- According to the user of the system, it can be divided into: specific person and non-specific person identification system;
- According to the system vocabulary: small vocabulary, medium vocabulary and large vocabulary system;
- According to the input method of speech: isolated words, connecting words, continuous speech system, etc.
- According to the pronunciation of the input voice, it can be divided into: reading style, spoken (natural pronunciation) style;
- According to the dialect background of the input speech, it can be divided into: Mandarin, Mandarin with dialect background, and dialect speech recognition system;
- According to the emotional state of the input speech, it is divided into neutral speech and emotional speech recognition systems.
Automatic speech recognition model
Most mainstream large-vocabulary speech recognition systems use statistical pattern recognition technology. A typical speech recognition system based on statistical pattern recognition consists of the following basic modules:
- Signal processing and feature extraction module: The main task of this module is to extract features from the input signal for processing by the acoustic model. At the same time, it generally also includes some signal processing techniques to minimize the impact of environmental noise, channels, speakers and other factors on the features.
- Acoustic model: Typical systems are mostly modeled based on the first-order hidden Markov model.
- Pronunciation dictionary: The pronunciation dictionary contains the vocabulary set that the system can handle and its pronunciation. The pronunciation dictionary actually provides a mapping between the acoustic model modeling unit and the language model modeling unit.
- Language model: The language model models the language that the system is targeting. In theory, various language models including regular languages and context-free grammars can be used as language models, but currently various systems generally use statistical N-gram grammars and their variants.
- Decoder: The decoder is one of the core components of the speech recognition system. Its task is to find the word string that can output the signal with the highest probability based on the acoustics, language model and dictionary.