HyperAI초신경
홈
뉴스
최신 연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
한국어
HyperAI초신경
Toggle sidebar
전체 사이트 검색...
⌘
K
홈
SOTA
Lipreading
Lipreading On Lrs3 Ted
Lipreading On Lrs3 Ted
평가 지표
Word Error Rate (WER)
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
Word Error Rate (WER)
Paper Title
Repository
SyncVSR
31.2
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
SyncVSR
21.5
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
VTP (more data)
30.7
Sub-word Level Lip Reading With Visual Attention
-
Auto-AVSR
19.1
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
VSP-LLM
25.4
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
CTC/Attention (LRW+LRS2/3+AVSpeech)
31.5
Visual Speech Recognition for Multiple Languages in the Wild
AV-HuBERT Large
26.9
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
Hyb + Conformer
43.3
End-to-end Audio-visual Speech Recognition with Conformers
DistillAV
26.2
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models
CTC-V2P
55.1
Large-Scale Visual Speech Recognition
-
EG-seq2seq
57.8
Discriminative Multi-modality Speech Recognition
LP + Conformer
12.8
Conformers are All You Need for Visual Speech Recognition
-
Conv-seq2seq
60.1
Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading
-
CTC + KD
59.8
ASR is all you need: cross-modal distillation for lip reading
-
ES³ Large
37.1
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
-
RAVEn Large
23.4
Jointly Learning Visual and Auditory Speech Representations from Raw Data
AV-HuBERT Large + Relaxed Attention + LM
25.51
Relaxed Attention for Transformer Models
USR (self + semi-supervised)
21.5
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
RNN-T
33.6
Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
ES³ Base
40.3
ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
-
0 of 23 row(s) selected.
Previous
Next