HyperAI
HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
HyperAI
Toggle sidebar
Search the site…
⌘
K
Home
SOTA
Lipreading
Lipreading On Lrs3 Ted
Lipreading On Lrs3 Ted
Metrics
Word Error Rate (WER)
Results
Performance results of various models on this benchmark
Columns
Model Name
Word Error Rate (WER)
Paper Title
Repository
SyncVSR
31.2
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
-
SyncVSR
21.5
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
-
VTP (more data)
30.7
Sub-word Level Lip Reading With Visual Attention
-
Auto-AVSR
19.1
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
-
VSP-LLM
25.4
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
-
CTC/Attention (LRW+LRS2/3+AVSpeech)
31.5
Visual Speech Recognition for Multiple Languages in the Wild
-
AV-HuBERT Large
26.9
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
-
Hyb + Conformer
43.3
End-to-end Audio-visual Speech Recognition with Conformers
-
DistillAV
26.2
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models
-
CTC-V2P
55.1
Large-Scale Visual Speech Recognition
-
EG-seq2seq
57.8
Discriminative Multi-modality Speech Recognition
-
LP + Conformer
12.8
Conformers are All You Need for Visual Speech Recognition
-
Conv-seq2seq
60.1
Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading
-
CTC + KD
59.8
ASR is all you need: cross-modal distillation for lip reading
-
ES³ Large
37.1
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
-
RAVEn Large
23.4
Jointly Learning Visual and Auditory Speech Representations from Raw Data
-
AV-HuBERT Large + Relaxed Attention + LM
25.51
Relaxed Attention for Transformer Models
-
USR (self + semi-supervised)
21.5
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
-
RNN-T
33.6
Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
-
ES³ Base
40.3
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
-
0 of 23 row(s) selected.
Previous
Next