HyperAI초신경

Lipreading On Lrs2

평가 지표

Word Error Rate (WER)

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름
Word Error Rate (WER)
Paper TitleRepository
ES³ Large26.7ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations-
ES³ Base*31.4ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations-
CTC + KD ASR53.2ASR is all you need: cross-modal distillation for lip reading-
USR15.4Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
VTP (more data)22.6Sub-word Level Lip Reading With Visual Attention-
LF-MMI TDNN48.86Audio-visual Recognition of Overlapped speech for the LRS2 dataset-
CTC/Attention (LRW+LRS2/3+AVSpeech)25.5Visual Speech Recognition for Multiple Languages in the Wild
Multi-head Visual-Audio Memory44.5Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading
ES³ Large + extLM24.6ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations-
SyncVSR28.9SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
MoCo + wav2vec (w/o extLM)43.2Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
ES³ Base + extLM28.7ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations-
TM-seq2seq + extLM48.3Deep Audio-Visual Speech Recognition
Hybrid CTC / Attention39.1End-to-end Audio-visual Speech Recognition with Conformers
Conv-seq2seq51.7Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading-
Auto-AVSR14.6Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
TM-CTC + extLM54.7Deep Audio-Visual Speech Recognition
SyncVSR16.5SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
RAVEn Large18.6Jointly Learning Visual and Auditory Speech Representations from Raw Data
ES³ Base* + extLM29.3ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations-
0 of 25 row(s) selected.