Lipreading On Lrs3 Ted

평가 지표

Word Error Rate (WER)

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름	Word Error Rate (WER)	Paper Title	Repository
SyncVSR	31.2	SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
SyncVSR	21.5	SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
VTP (more data)	30.7	Sub-word Level Lip Reading With Visual Attention	-
Auto-AVSR	19.1	Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
VSP-LLM	25.4	Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
CTC/Attention (LRW+LRS2/3+AVSpeech)	31.5	Visual Speech Recognition for Multiple Languages in the Wild
AV-HuBERT Large	26.9	Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
Hyb + Conformer	43.3	End-to-end Audio-visual Speech Recognition with Conformers
DistillAV	26.2	Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models
CTC-V2P	55.1	Large-Scale Visual Speech Recognition	-
EG-seq2seq	57.8	Discriminative Multi-modality Speech Recognition
LP + Conformer	12.8	Conformers are All You Need for Visual Speech Recognition	-
Conv-seq2seq	60.1	Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading	-
CTC + KD	59.8	ASR is all you need: cross-modal distillation for lip reading	-
ES³ Large	37.1	ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations	-
RAVEn Large	23.4	Jointly Learning Visual and Auditory Speech Representations from Raw Data
AV-HuBERT Large + Relaxed Attention + LM	25.51	Relaxed Attention for Transformer Models
USR (self + semi-supervised)	21.5	Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
RNN-T	33.6	Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
ES³ Base	40.3	ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations	-

0 of 23 row(s) selected.