3D RotNet (3D ResNet-18) | 62.9 | false | Kinetics400 | Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction | - |
DPC (3D ResNet-18, Split 1) | 60.6 | false | UCF101 | Video Representation Learning by Dense Predictive Coding | |
CVRL (R3D-50; K400) | 92.2 | false | Kinetics400 | Spatiotemporal Contrastive Video Representation Learning | |
3D Cubic Puzzles (3D ResNet-18) | 65.8 | false | Kinetics400 | Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles | - |
AVID (Modified R2+1D-18 on Audioset) | 91.0 | false | Audioset (Audio+Video) | Audio-Visual Instance Discrimination with Cross-Modal Agreement | |
CVRL (R3D-50; K600) | 93.4 | false | Kinetics600 | Spatiotemporal Contrastive Video Representation Learning | |
CVRL (R3D-152 2x; K600) | 93.9 | false | Kinetics600 | Spatiotemporal Contrastive Video Representation Learning | |
VideoGan (C3D) | 52.1 | false | UCF101 | Generating Videos with Scene Dynamics | - |
AVID+CMA (Modified R2+1D-18 on Kinetics) | 87.5 | false | Kinetics400 (Audio+Video) | Audio-Visual Instance Discrimination with Cross-Modal Agreement | |
AVID+CMA (Modified R2+1D-18 on Audioset) | 91.5 | false | Audioset (Audio+Video) | Audio-Visual Instance Discrimination with Cross-Modal Agreement | |
CrissCross (Kinetics-Sound) | 88.3 | false | Kinetics-Sound | Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity | |