Video Retrieval On Activitynet
평가 지표
text-to-video Median Rank
text-to-video R@1
text-to-video R@5
text-to-video R@50
평가 결과
이 벤치마크에서 각 모델의 성능 결과
비교 표
모델 이름 | text-to-video Median Rank | text-to-video R@1 | text-to-video R@5 | text-to-video R@50 |
---|---|---|---|---|
advancing-high-resolution-video-language | 4 | 28.5 | 57.4 | 94 |
revealing-single-frame-bias-for-video-and | - | 47.1 | 75.5 | - |
rtq-rethinking-video-language-understanding | - | 53.5 | 81.4 | - |
multi-modal-transformer-for-video-retrieval | 3.3 | 28.7 | 61.4 | 94.5 |
video-and-text-matching-with-conditioned | - | 25.4 | 59.1 | - |
improving-video-text-retrieval-by-multi | 1 | 51.0 | 77.7 | - |
internvideo-general-video-foundation-models | - | 62.2 | - | - |
diffusionret-generative-text-video-retrieval | 2.0 | 45.8 | 75.6 | - |
x-clip-end-to-end-multi-grained-contrastive | - | 46.2 | 75.5 | - |
hitea-hierarchical-temporal-aware-video | - | 49.7 | 77.1 | - |
valor-vision-audio-language-omni-perception | - | 70.1 | 90.8 | - |
clip4clip-an-empirical-study-of-clip-for-end | 2 | 40.5 | 73.4 | 98.2 |
video-text-as-game-players-hierarchical | 2.0 | 42.2 | 73.0 | - |
expectation-maximization-contrastive-learning | - | 41.2 | 72.7 | - |
expectation-maximization-contrastive-learning | - | 50.6 | 78.7 | 98.1 |
cosa-concatenated-sample-pretrained-vision | - | 67.3 | - | - |
diffusionret-generative-text-video-retrieval | 2.0 | 48.1 | - | - |
multi-modal-transformer-for-video-retrieval | 5 | 22.7 | 54.2 | 93.2 |
clip-vip-adapting-pre-trained-image-text | 1 | 61.4 | 85.7 | - |
dual-modal-attention-enhanced-text-video | 1.0 | 53.4 | 80.7 | - |
hunyuan-tvr-for-text-video-retrivial | 1 | 57.3 | 84.8 | - |
vid-tldr-training-free-token-merging-for | - | 66.7 | 88.6 | - |
unmasked-teacher-towards-training-efficient | - | 66.8 | 89.1 | - |
vindlu-a-recipe-for-effective-video-and | - | 55.0 | 81.4 | - |
testa-temporal-spatial-token-aggregation-for | - | 54.8 | 80.8 | - |
centerclip-token-clustering-for-efficient | 2 | 46.2 | 77.0 | - |
gramian-multimodal-representation-learning | - | 69.9 | - | - |
taco-token-aware-cascade-contrastive-learning | 3.0 | 30.4 | 61.2 | 93.4 |
internvideo2-scaling-video-foundation-models | - | 74.1 | - | - |
vast-a-vision-audio-subtitle-text-omni-1 | - | 70.5 | 90.9 | - |
use-what-you-have-video-retrieval-using | 6 | 20.5 | 47.7 | 91.4 |