3D Conv + ResNet-18 + Bi-GRU + Visual-Audio Memory | 50.82% | Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video | |
3D-ResNet + Bi-GRU + MixUp + Label Smooth + Cosine LR (Word Boundary) | 55.7% | Learn an Effective Lip Reading Model without Pains | |
3D Conv + ResNet-18 + MS-TCN + Multi-Head Visual-Audio Memory | 53.8 | Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading | |
3D Conv + ResNet-18 + Bi-GRU (Face Cutout) | 45.24% | Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition | |
3D-ResNet + Bi-GRU + MixUp + Label Smooth + Cosine LR | 48.3% | Learn an Effective Lip Reading Model without Pains | |