HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
Toggle sidebar
Search the site…
⌘
K
Home
SOTA
Lipreading
Lipreading On Lrs3 Ted
Lipreading On Lrs3 Ted
Metrics
Word Error Rate (WER)
Results
Performance results of various models on this benchmark
Columns
Model Name
Word Error Rate (WER)
Paper Title
Repository
SyncVSR
31.2
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
SyncVSR
21.5
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
VTP (more data)
30.7
Sub-word Level Lip Reading With Visual Attention
-
Auto-AVSR
19.1
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
VSP-LLM
25.4
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
CTC/Attention (LRW+LRS2/3+AVSpeech)
31.5
Visual Speech Recognition for Multiple Languages in the Wild
AV-HuBERT Large
26.9
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
Hyb + Conformer
43.3
End-to-end Audio-visual Speech Recognition with Conformers
DistillAV
26.2
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models
CTC-V2P
55.1
Large-Scale Visual Speech Recognition
-
EG-seq2seq
57.8
Discriminative Multi-modality Speech Recognition
LP + Conformer
12.8
Conformers are All You Need for Visual Speech Recognition
-
Conv-seq2seq
60.1
Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading
-
CTC + KD
59.8
ASR is all you need: cross-modal distillation for lip reading
-
ES³ Large
37.1
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
-
RAVEn Large
23.4
Jointly Learning Visual and Auditory Speech Representations from Raw Data
AV-HuBERT Large + Relaxed Attention + LM
25.51
Relaxed Attention for Transformer Models
USR (self + semi-supervised)
21.5
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
RNN-T
33.6
Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
ES³ Base
40.3
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
-
0 of 23 row(s) selected.
Previous
Next