Audio Visual Speech Recognition On Lrs3 Ted
Métriques
Word Error Rate (WER)
Résultats
Résultats de performance de divers modèles sur ce benchmark
Tableau comparatif
Nom du modèle | Word Error Rate (WER) |
---|---|
discriminative-multi-modality-speech | 6.8 |
audio-visual-representation-learning-via | 1.3 |
deep-audio-visual-speech-recognition | 7.2 |
recurrent-neural-network-transducer-for-audio | 4.5 |
end-to-end-audio-visual-speech-recognition | 2.3 |
robust-self-supervised-audio-visual-speech | 1.4 |
large-language-models-are-strong-audio-visual | 0.77 |
auto-avsr-audio-visual-speech-recognition | 0.9 |
whisper-flamingo-integrating-visual-features | 0.76 |
jointly-learning-visual-and-auditory-speech | 1.4 |
zero-avsr-zero-shot-audio-visual-speech | 1.5 |
mms-llama-efficient-llm-based-audio-visual-1 | 0.74 |