Video Retrieval On Msr Vtt
Metriken
text-to-video R@1
text-to-video R@10
text-to-video R@5
Ergebnisse
Leistungsergebnisse verschiedener Modelle zu diesem Benchmark
Vergleichstabelle
Modellname | text-to-video R@1 | text-to-video R@10 | text-to-video R@5 |
---|---|---|---|
audio-enhanced-text-to-video-retrieval-using | 52 | 86.1 | 76.6 |
video-text-modeling-with-zero-shot-transfer | 34.3 | 67.0 | 57.8 |
taco-token-aware-cascade-contrastive-learning | 24.8 | 64.0 | 52.1 |
an-empirical-study-of-end-to-end-video | 37.2 | 75.8 | 64.8 |
cosa-concatenated-sample-pretrained-vision | 57.9 | - | - |
coca-contrastive-captioners-are-image-text | 30.0 | 61.6 | 52.4 |
a-straightforward-framework-for-video | 21.4 | 50.4 | 41.1 |
rome-role-aware-mixture-of-expert-transformer | 10.7 | 41.2 | 29.6 |
internvideo2-scaling-video-foundation-models | 62.8 | - | - |
learning-language-visual-embedding-for-movie | 4.2 | 19.9 | - |
valor-vision-audio-language-omni-perception | 59.9 | 89.6 | 83.5 |
gramian-multimodal-representation-learning | 64 | 89.3 | - |
Modell 13 | 52.4 | 82 | 73.9 |
temporal-tessellation-a-unified-approach-for | 4.7 | 24.1 | - |
video-and-text-matching-with-conditioned | 26 | - | 56.7 |
howto100m-learning-a-text-video-embedding-by | 14.9 | 52.8 | - |
frozen-in-time-a-joint-video-and-image | 32.5 | 71.2 | 61.5 |
lightweight-attentional-feature-fusion-for | 29.1 | 65.8 | 54.9 |
meltr-meta-loss-transformer-for-learning-to | 38.6 | 84.7 | 74.4 |
a-joint-sequence-fusion-model-for-video | 10.2 | 43.2 | - |
cots-collaborative-two-stream-vision-language | 32.1 | 70.2 | 60.8 |
unified-coarse-to-fine-alignment-for-video | 49.4 | 83.5 | 72.1 |
vid-tldr-training-free-token-merging-for | 58.1 | 81.6 | 81.0 |
clip2tv-an-empirical-study-on-transformer | 33.1 | 68.9 | 58.9 |
univilm-a-unified-video-and-language-pre | 21.2 | 63.1 | 49.6 |
improving-video-text-retrieval-by-multi | 32.9 | 68.4 | 58.3 |
use-what-you-have-video-retrieval-using | 10.0 | 41.2 | 29.0 |
advancing-high-resolution-video-language | 35.6 | 78 | 65.3 |
meltr-meta-loss-transformer-for-learning-to | 33.6 | 77.8 | 63.7 |
omnivl-one-foundation-model-for-image | 47.8 | 83.8 | 74.2 |
vlab-enhancing-video-language-pre-training-by | 55.1 | 87.6 | 78.8 |
mdmmt-multidomain-multimodal-transformer-for | 23.1 | 61.8 | 49.8 |
mdmmt-2-multidomain-multimodal-transformer | 33.7 | 70.8 | 60.5 |
learning-joint-embedding-with-multimodal-cues | 7.0 | 29.7 | 20.9 |
meltr-meta-loss-transformer-for-learning-to | 28.5 | 67.6 | 55.5 |
internvideo-general-video-foundation-models | 55.2 | - | - |
vast-a-vision-audio-subtitle-text-omni-1 | 63.9 | 89.6 | 84.3 |
unmasked-teacher-towards-training-efficient | 58.8 | 87.1 | 81.0 |
clip2video-mastering-video-text-retrieval-via | 29.8 | 66.2 | 55.5 |
clip4clip-an-empirical-study-of-clip-for-end | 44.5 | 81.6 | 71.4 |