3D Conv + ResNet-18 + MS-TCN + Multi-Head Visual-Audio Memory | 88.5 | Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading | |
3D Conv + ResNet-18 + DC-TCN + KD (Ensemble & Word Boundary) | 94.1 | Training Strategies for Improved Lip-reading | |
Multi-grained + Bi-ConvLSTM | 83.34 | Multi-Grained Spatio-temporal Modeling for Lip-reading | - |
SpotFast + Transformer + Product-Key memory | 84.4 | SpotFast Networks with Memory Augmented Lateral Transformers for Lipreading | |
3D Conv + ResNet-18 + MS-TCN | 85.30 | Lipreading using Temporal Convolutional Networks | |
Vosk + MediaPipe + LS + MixUp + SA + 3DResNet-18 + BiLSTM + Cosine WR | 88.7 | Visual Speech Recognition in a Driver Assistance System | - |
3D Conv + EfficientNetV2 + Transformer + TCN | 89.52 | Accurate and Resource-Efficient Lipreading with Efficientnetv2 and Transformers | - |
3D Conv + ResNet-34 + Bi-LSTM | 83.00 | Combining Residual Networks with LSTMs for Lipreading | |
3D-ResNet + Bi-GRU + MixUp + Label Smoothing + Cosine LR | 85.5 | Learn an Effective Lip Reading Model without Pains | |
3D Conv + ResNet-18 + Bi-GRU + Visual-Audio Memory | 85.4 | Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video | |
3D Conv + ResNet-18 + MS-TCN + KD (Ensemble) | 88.5 | Towards Practical Lipreading with Distilled and Efficient Models | |
3D Conv + ResNet-18 + Bi-GRU | 84.41 | Mutual Information Maximization for Effective Lip Reading | |
3D Conv + ResNet-34 + Bi-GRU | 83.39 | End-to-end Audiovisual Speech Recognition | |
3D-ResNet + Bi-GRU + MixUp + Label Smoothing + Cosine LR (Word Boundary) | 88.4 | Learn an Effective Lip Reading Model without Pains | |