IcoCap (ViT-B/32) | 56.3 | 103.8 | 38.9 | 75.0 | IcoCap: Improving Video Captioning by Compounding Images | - |
CoCap (ViT/L14) | 60.1 | 121.5 | 41.4 | 78.2 | Accurate and Fast Compressed Video Captioning | |
VASTA (Vatex-backbone) | 59.2 | 119.7 | 40.65 | 76.7 | Diverse Video Captioning by Adaptive Spatio-temporal Attention | |
IcoCap (ViT-B/16) | 59.1 | 110.3 | 39.5 | 76.5 | IcoCap: Improving Video Captioning by Compounding Images | - |
VASTA (Kinetics-backbone) | 56.1 | 106.4 | 39.1 | 74.5 | Diverse Video Captioning by Adaptive Spatio-temporal Attention | |