HyperAI超神经

Lipreading On Lrs3 Ted

评估指标

Word Error Rate (WER)

评测结果

各个模型在此基准测试上的表现结果

模型名称
Word Error Rate (WER)
Paper TitleRepository
SyncVSR31.2SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
SyncVSR21.5SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
VTP (more data)30.7Sub-word Level Lip Reading With Visual Attention-
Auto-AVSR19.1Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
VSP-LLM25.4Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
CTC/Attention (LRW+LRS2/3+AVSpeech)31.5Visual Speech Recognition for Multiple Languages in the Wild
AV-HuBERT Large26.9Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
Hyb + Conformer43.3End-to-end Audio-visual Speech Recognition with Conformers
DistillAV26.2Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models
CTC-V2P55.1Large-Scale Visual Speech Recognition-
EG-seq2seq57.8Discriminative Multi-modality Speech Recognition
LP + Conformer12.8Conformers are All You Need for Visual Speech Recognition-
Conv-seq2seq60.1Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading-
CTC + KD59.8ASR is all you need: cross-modal distillation for lip reading-
ES³ Large37.1ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations-
RAVEn Large23.4Jointly Learning Visual and Auditory Speech Representations from Raw Data
AV-HuBERT Large + Relaxed Attention + LM25.51Relaxed Attention for Transformer Models
USR (self + semi-supervised)21.5Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
RNN-T33.6Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
ES³ Base40.3ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations-
0 of 23 row(s) selected.