Video Retrieval On Lsmdc
評価指標
text-to-video Mean Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
比較表
モデル名 | text-to-video Mean Rank | text-to-video R@1 | text-to-video R@10 | text-to-video R@5 |
---|---|---|---|---|
improving-video-text-retrieval-by-multi | 54.4 | 25.9 | 53.7 | 46.1 |
advancing-high-resolution-video-language | - | 17.4 | 44.1 | 34.1 |
clip4clip-an-empirical-study-of-clip-for-end | 58.0 | 21.6 | 49.8 | 41.8 |
use-what-you-have-video-retrieval-using | - | 11.2 | 34.8 | 26.9 |
mdmmt-multidomain-multimodal-transformer-for | 58.0 | 18.8 | 47.9 | 38.5 |
expectation-maximization-contrastive-learning | - | 23.9 | 50.9 | 42.4 |
valor-vision-audio-language-omni-perception | - | 34.2 | 64.1 | 56.0 |
x-pool-cross-modal-language-video-attention | 53.2 | 25.2 | 53.5 | 43.7 |
expectation-maximization-contrastive-learning | 8 | - | 53.7 | - |
hunyuan-tvr-for-text-video-retrivial | 3.9 | 40.4 | 92.8 | 80.1 |
internvideo-general-video-foundation-models | - | 34.0 | - | - |
mdmmt-2-multidomain-multimodal-transformer | 48.0 | 26.9 | 55.9 | 46.7 |
a-straightforward-framework-for-video | - | 11.3 | 29.2 | 22.7 |
x-clip-end-to-end-multi-grained-contrastive | - | 26.1 | - | - |
hitea-hierarchical-temporal-aware-video | - | 28.7 | 59.0 | 50.3 |
learning-a-text-video-embedding-from | - | 10.1 | 34.6 | 25.6 |
multi-modal-transformer-for-video-retrieval | - | 13.5 | 40.1 | 29.9 |
cross-modal-retrieval-with-querybank | - | 22.4 | 49.5 | 40.1 |
centerclip-token-clustering-for-efficient | 47.3 | 24.2 | 55.9 | 46.2 |
an-empirical-study-of-end-to-end-video | - | 24 | 54.1 | 43.5 |
mplug-2-a-modularized-multi-modal-foundation | - | 34.4 | 65.1 | 55.2 |
a-joint-sequence-fusion-model-for-video | - | 9.1 | 34.1 | 21.2 |
cosa-concatenated-sample-pretrained-vision | - | 39.4 | - | - |
learning-from-video-and-text-via-large-scale | - | 7.3 | 27.1 | 19.2 |
unmasked-teacher-towards-training-efficient | - | 43.0 | 73.0 | 65.5 |
howto100m-learning-a-text-video-embedding-by | - | 7.2 | 27.9 | 19.6 |
vid-tldr-training-free-token-merging-for | - | 43.1 | 71.4 | 64.5 |
frozen-in-time-a-joint-video-and-image | - | 15.0 | 39.8 | 30.8 |
expectation-maximization-contrastive-learning | - | 25.9 | - | 46.4 |
revisiting-temporal-modeling-for-clip-based | - | 29.2 | 58.8 | 49.5 |
end-to-end-concept-word-detection-for-video | - | 5.1 | 25.2 | 16.3 |
video-and-text-matching-with-conditioned | - | 14.9 | - | 33.2 |
internvideo2-scaling-video-foundation-models | - | 46.4 | - | - |
clip-vip-adapting-pre-trained-image-text | - | 30.7 | 60.6 | 51.4 |
clover-towards-a-unified-video-language | - | 24.8 | 54.5 | 44 |
hunyuan-tvr-for-text-video-retrivial | 56.4 | 29.7 | 55.4 | 46.4 |
diffusionret-generative-text-video-retrieval | 40.7 | 24.4 | 54.3 | 43.1 |
multi-modal-transformer-for-video-retrieval | - | 13.2 | 38.8 | 29.2 |