Video Retrieval On Didemo
평가 지표
text-to-video R@1
text-to-video R@10
text-to-video R@5
평가 결과
이 벤치마크에서 각 모델의 성능 결과
비교 표
모델 이름 | text-to-video R@1 | text-to-video R@10 | text-to-video R@5 |
---|---|---|---|
rtq-rethinking-video-language-understanding | 57.6 | 89.9 | 84.1 |
frozen-in-time-a-joint-video-and-image | 31.0 | 72.4 | 59.8 |
diffusionret-generative-text-video-retrieval | 48.9 | 83.3 | 75.5 |
hunyuan-tvr-for-text-video-retrivial | 52.7 | 85.2 | 77.8 |
advancing-high-resolution-video-language | 28.8 | 69.1 | 57.4 |
cross-modal-retrieval-with-querybank | 43.5 | 80.9 | 71.4 |
cap4video-what-can-auxiliary-captions-do-for | 52.0 | 87.5 | 79.4 |
revisiting-temporal-modeling-for-clip-based | 54.6 | 85.1 | 78.4 |
clip-vip-adapting-pre-trained-image-text | 55.3 | 89.3 | 82 |
align-and-prompt-video-and-language-pre | 35.9 | 78.8 | 67.5 |
disentangled-representation-learning-for-text | 49.0 | 84.5 | 76.5 |
vast-a-vision-audio-subtitle-text-omni-1 | 72.0 | 91.4 | 89.0 |
dual-modal-attention-enhanced-text-video | 52.7 | 86.6 | 79.3 |
mplug-2-a-modularized-multi-modal-foundation | 56.4 | 85.2 | 79.1 |
clover-towards-a-unified-video-language | 50.1 | 85.6 | 76.7 |
vindlu-a-recipe-for-effective-video-and | 61.2 | 91.0 | 85.8 |
use-what-you-have-video-retrieval-using | 16.1 | 54.4 | 41.1 |
multi-efficient-video-and-language | 56.5 | 87.0 | 80.2 |
unmasked-teacher-towards-training-efficient | 70.4 | 93.5 | 90.1 |
hitea-hierarchical-temporal-aware-video | 56.5 | 89.7 | 81.7 |
hunyuan-tvr-for-text-video-retrivial | 52.1 | 85.7 | 78.2 |
revealing-single-frame-bias-for-video-and | 53.9 | 86.9 | 79.4 |
an-empirical-study-of-end-to-end-video | 47.9 | 84.1 | 76.5 |
cosa-concatenated-sample-pretrained-vision | 70.5 | - | - |
gramian-multimodal-representation-learning | 67.3 | 90.1 | - |
x-clip-end-to-end-multi-grained-contrastive | 47.8 | - | 79.3 |
internvideo-general-video-foundation-models | 57.9 | - | - |
testa-temporal-spatial-token-aggregation-for | 61.2 | 91.5 | 87.2 |
prototype-based-aleatoric-uncertainty-1 | 48.6 | 84.5 | 76.0 |
valor-vision-audio-language-omni-perception | 61.5 | 90.4 | 85.3 |
video-text-as-game-players-hierarchical | 46.9 | 82.7 | 74.9 |
vid-tldr-training-free-token-merging-for | 72.3 | 94.2 | 91.2 |
vlab-enhancing-video-language-pre-training-by | 56.8 | 88.7 | 81.6 |
rudder-a-cross-lingual-video-and-text | 16.3 | 56.5 | - |
omnivl-one-foundation-model-for-image | 52.4 | 85.4 | 79.5 |
diffusionret-generative-text-video-retrieval | 46.7 | 82.7 | 74.7 |
improving-video-text-retrieval-by-multi | 43.8 | 79.9 | 71.4 |
모델 38 | - | 85.3 | 77.4 |
clip4clip-an-empirical-study-of-clip-for-end | 43.4 | 80.6 | 70.2 |
internvideo2-scaling-video-foundation-models | 74.2 | - | - |