Zero Shot Video Retrieval On Msr Vtt
평가 지표
text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
video-to-text Median Rank
video-to-text R@1
video-to-text R@10
video-to-text R@5
평가 결과
이 벤치마크에서 각 모델의 성능 결과
비교 표
모델 이름 | text-to-video Median Rank | text-to-video R@1 | text-to-video R@10 | text-to-video R@5 | video-to-text Median Rank | video-to-text R@1 | video-to-text R@10 | video-to-text R@5 |
---|---|---|---|---|---|---|---|---|
languagebind-extending-video-language | 2.0 | 42.8 | 76.0 | 67.5 | 3.0 | 38.3 | 77.8 | 65.8 |
unmasked-teacher-towards-training-efficient | - | 42.6 | 73.1 | 64.4 | - | 38.6 | 69.6 | 59.8 |
videoclip-contrastive-pre-training-for-zero | - | 10.4 | 30.0 | 22.2 | - | - | - | - |
multi-granularity-correspondence-learning-1 | - | 10.7 | - | 24.1 | - | - | - | - |
mplug-2-a-modularized-multi-modal-foundation | - | 47.1 | 79.0 | 69.7 | - | - | - | - |
omnivl-one-foundation-model-for-image | - | 34.6 | 66.6 | 58.4 | - | - | - | - |
end-to-end-learning-of-visual-representations | - | 9.9 | 32.4 | 24.0 | - | - | - | - |
hitea-hierarchical-temporal-aware-video | - | 29.9 | 62.9 | 54.2 | - | - | - | - |
internvideo2-scaling-video-foundation-models | - | 51.9 | 82.5 | 75.3 | - | 50.9 | 81.8 | 73.4 |
align-and-prompt-video-and-language-pre | 8 | 24.1 | 55.4 | 44.7 | - | - | - | - |
revealing-single-frame-bias-for-video-and | - | 34.0 | 66.7 | 56.7 | - | - | - | - |
multi-modal-transformer-for-video-retrieval | 66 | - | - | 14.4 | - | - | - | - |
revealing-single-frame-bias-for-video-and | - | 28.4 | 59.5 | 50.2 | - | - | - | - |
hitea-hierarchical-temporal-aware-video | - | 34.4 | 69.9 | 60.0 | - | - | - | - |
gramian-multimodal-representation-learning | - | 54.8 | 83.9 | - | - | 52.9 | 82.9 | - |
clip4clip-an-empirical-study-of-clip-for-end | 4 | 32.0 | 66.9 | 57.0 | - | - | - | - |
noise-estimation-using-density-estimation-for | - | 8.0 | 29.3 | 21.3 | - | - | - | - |
internvideo-general-video-foundation-models | - | 40.7 | - | - | - | 39.6 | - | - |
clover-towards-a-unified-video-language | 6 | 26.4 | 60 | 49.5 | - | - | - | - |
howtocaption-prompting-llms-to-transform | 3 | 37.6 | 73.3 | 62 | - | - | - | - |
vast-a-vision-audio-subtitle-text-omni-1 | - | 49.3 | 73.9 | 68.3 | - | - | - | - |
taco-token-aware-cascade-contrastive-learning | - | 9.8 | 33.4 | 25.0 | - | - | - | - |
vid-tldr-training-free-token-merging-for | - | 42.1 | 72.4 | 63.9 | - | 37.7 | 69.4 | 59.8 |
one-for-all-video-conversation-is-feasible | - | 40.9 | 73.5 | 64.7 | - | - | - | - |
advancing-high-resolution-video-language | 15 | 14.6 | 44.1 | 34.4 | - | - | - | - |
vatt-transformers-for-multimodal-self | 49 | - | 29.7 | - | - | - | - | - |
internvideo2-scaling-video-foundation-models | - | 55.9 | 85.1 | 78.3 | - | 53.7 | 84.1 | 77.5 |
object-aware-video-language-pre-training-for | 8.0 | 23.4 | 55.6 | 47.5 | - | - | - | - |
miles-visual-bert-pre-training-with-injected | 7 | 26.1 | 56.9 | 47.2 | - | - | - | - |
howtocaption-prompting-llms-to-transform | 1 | 50 | 81.4 | 73.2 | - | - | - | - |
imagebind-one-embedding-space-to-bind-them | - | 36.8 | 70.0 | 61.8 | - | - | - | - |
bridgeformer-bridging-video-text-retrieval | 7.0 | 26.0 | 56.4 | 46.4 | - | - | - | - |
florence-a-new-foundation-model-for-computer | - | 37.6 | 72.6 | 63.8 | - | - | - | - |
lat-latent-translation-with-cycle-consistency | 8 | 23.4 | 53.3 | 44.1 | 12 | 17.2 | 47.9 | 36.2 |
learning-audio-video-modalities-from-image | - | 19.4 | 50.3 | 39.5 | - | - | - | - |
languagebind-extending-video-language | 2 | 44.8 | 78.7 | 70.0 | 2. | 40.9 | 75.7 | 66.4 |
frozen-in-time-a-joint-video-and-image | 7.0 | 24.7 | 57.2 | 46.9 | - | - | - | - |
violet-end-to-end-video-language-transformers | - | 25.9 | 59.7 | 49.5 | - | - | - | - |
seeing-what-you-miss-vision-language-pre | - | 30.9 | 65.0 | 54.4 | - | - | - | - |