Zero Shot Video Retrieval On Didemo
評価指標
text-to-video R@1
text-to-video R@10
text-to-video R@5
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
比較表
モデル名 | text-to-video R@1 | text-to-video R@10 | text-to-video R@5 |
---|---|---|---|
revealing-single-frame-bias-for-video-and | 36.9 | 69.3 | 61.1 |
internvideo2-scaling-video-foundation-models | 57.9 | 84.6 | 80.0 |
one-for-all-video-conversation-is-feasible | 35.6 | 72.6 | 61.9 |
languagebind-extending-video-language | 39.9 | 74.6 | 66.1 |
hitea-hierarchical-temporal-aware-video | 43.2 | 79.0 | 69.3 |
clover-towards-a-unified-video-language | 29.5 | 66.3 | 55.2 |
languagebind-extending-video-language | 39.7 | 73.8 | 65.5 |
mplug-2-a-modularized-multi-modal-foundation | 45.7 | 79.2 | 71.1 |
vast-a-vision-audio-subtitle-text-omni-1 | 55.5 | 79.6 | 74.3 |
revealing-single-frame-bias-for-video-and | 37.1 | 69.9 | 61.7 |
violet-end-to-end-video-language-transformers | 23.5 | 59.8 | 49.8 |
miles-visual-bert-pre-training-with-injected | 27.2 | 63.6 | 50.3 |
gramian-multimodal-representation-learning | 54.2 | 80.7 | - |
align-and-prompt-video-and-language-pre | 23.8 | 57.9 | 47.3 |
internvideo-general-video-foundation-models | 31.5 | 68.2 | 57.6 |
videoclip-contrastive-pre-training-for-zero | 16.6 | - | 46.9 |
frozen-in-time-a-joint-video-and-image | 21.1 | 56.2 | 46.0 |
bridgeformer-bridging-video-text-retrieval | 25.6 | 61.1 | 50.6 |
hitea-hierarchical-temporal-aware-video | 36.1 | 70.3 | 60.1 |
object-aware-video-language-pre-training-for | 23.5 | 59.8 | 50.4 |
vid-tldr-training-free-token-merging-for | 52.0 | 81.0 | 74.0 |
unmasked-teacher-towards-training-efficient | 48.6 | 79.0 | 72.9 |
frozen-in-time-a-joint-video-and-image | 20.2 | 58.5 | 46.4 |
omnivl-one-foundation-model-for-image | 33.3 | 68.5 | 58.7 |
lat-latent-translation-with-cycle-consistency | 22.6 | 58.9 | 45.9 |
internvideo2-scaling-video-foundation-models | 57.0 | 85.1 | 80.0 |