HyperAI
HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
HyperAI
Toggle sidebar
Search the site…
⌘
K
Home
SOTA
Lipreading
Lipreading On Lrs2
Lipreading On Lrs2
Metrics
Word Error Rate (WER)
Results
Performance results of various models on this benchmark
Columns
Model Name
Word Error Rate (WER)
Paper Title
Repository
ES³ Large
26.7
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
-
ES³ Base*
31.4
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
-
CTC + KD ASR
53.2
ASR is all you need: cross-modal distillation for lip reading
-
USR
15.4
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
-
VTP (more data)
22.6
Sub-word Level Lip Reading With Visual Attention
-
LF-MMI TDNN
48.86
Audio-visual Recognition of Overlapped speech for the LRS2 dataset
-
CTC/Attention (LRW+LRS2/3+AVSpeech)
25.5
Visual Speech Recognition for Multiple Languages in the Wild
-
Multi-head Visual-Audio Memory
44.5
Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading
-
ES³ Large + extLM
24.6
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
-
SyncVSR
28.9
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
-
MoCo + wav2vec (w/o extLM)
43.2
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
-
ES³ Base + extLM
28.7
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
-
TM-seq2seq + extLM
48.3
Deep Audio-Visual Speech Recognition
-
Hybrid CTC / Attention
39.1
End-to-end Audio-visual Speech Recognition with Conformers
-
Conv-seq2seq
51.7
Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading
-
Auto-AVSR
14.6
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
-
TM-CTC + extLM
54.7
Deep Audio-Visual Speech Recognition
-
SyncVSR
16.5
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
-
RAVEn Large
18.6
Jointly Learning Visual and Auditory Speech Representations from Raw Data
-
ES³ Base* + extLM
29.3
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
-
0 of 25 row(s) selected.
Previous
Next
Lipreading On Lrs2 | SOTA | HyperAI