HyperAI
HyperAI
الرئيسية
الأخبار
أحدث الأوراق البحثية
الدروس
مجموعات البيانات
الموسوعة
SOTA
نماذج LLM
لوحة الأداء GPU
الفعاليات
البحث
حول
العربية
HyperAI
HyperAI
Toggle sidebar
البحث في الموقع...
⌘
K
الرئيسية
SOTA
القراءة_شفهية
Lipreading On Lrs3 Ted
Lipreading On Lrs3 Ted
المقاييس
Word Error Rate (WER)
النتائج
نتائج أداء النماذج المختلفة على هذا المعيار القياسي
Columns
اسم النموذج
Word Error Rate (WER)
Paper Title
Repository
SyncVSR
31.2
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
-
SyncVSR
21.5
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
-
VTP (more data)
30.7
Sub-word Level Lip Reading With Visual Attention
-
Auto-AVSR
19.1
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
-
VSP-LLM
25.4
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
-
CTC/Attention (LRW+LRS2/3+AVSpeech)
31.5
Visual Speech Recognition for Multiple Languages in the Wild
-
AV-HuBERT Large
26.9
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
-
Hyb + Conformer
43.3
End-to-end Audio-visual Speech Recognition with Conformers
-
DistillAV
26.2
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models
-
CTC-V2P
55.1
Large-Scale Visual Speech Recognition
-
EG-seq2seq
57.8
Discriminative Multi-modality Speech Recognition
-
LP + Conformer
12.8
Conformers are All You Need for Visual Speech Recognition
-
Conv-seq2seq
60.1
Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading
-
CTC + KD
59.8
ASR is all you need: cross-modal distillation for lip reading
-
ES³ Large
37.1
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
-
RAVEn Large
23.4
Jointly Learning Visual and Auditory Speech Representations from Raw Data
-
AV-HuBERT Large + Relaxed Attention + LM
25.51
Relaxed Attention for Transformer Models
-
USR (self + semi-supervised)
21.5
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
-
RNN-T
33.6
Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
-
ES³ Base
40.3
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
-
0 of 23 row(s) selected.
Previous
Next
Lipreading On Lrs3 Ted | SOTA | HyperAI