HyperAI超神経

Video Retrieval On Lsmdc

評価指標

text-to-video Mean Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

比較表
モデル名text-to-video Mean Ranktext-to-video R@1text-to-video R@10text-to-video R@5
improving-video-text-retrieval-by-multi54.425.953.746.1
advancing-high-resolution-video-language-17.444.134.1
clip4clip-an-empirical-study-of-clip-for-end58.021.649.841.8
use-what-you-have-video-retrieval-using-11.234.826.9
mdmmt-multidomain-multimodal-transformer-for58.018.847.938.5
expectation-maximization-contrastive-learning-23.950.942.4
valor-vision-audio-language-omni-perception-34.264.156.0
x-pool-cross-modal-language-video-attention53.225.253.543.7
expectation-maximization-contrastive-learning8-53.7-
hunyuan-tvr-for-text-video-retrivial3.940.492.880.1
internvideo-general-video-foundation-models-34.0--
mdmmt-2-multidomain-multimodal-transformer48.026.955.946.7
a-straightforward-framework-for-video-11.329.222.7
x-clip-end-to-end-multi-grained-contrastive-26.1--
hitea-hierarchical-temporal-aware-video-28.759.050.3
learning-a-text-video-embedding-from-10.134.625.6
multi-modal-transformer-for-video-retrieval-13.540.129.9
cross-modal-retrieval-with-querybank-22.449.540.1
centerclip-token-clustering-for-efficient47.324.255.946.2
an-empirical-study-of-end-to-end-video-2454.143.5
mplug-2-a-modularized-multi-modal-foundation-34.465.155.2
a-joint-sequence-fusion-model-for-video-9.134.121.2
cosa-concatenated-sample-pretrained-vision-39.4--
learning-from-video-and-text-via-large-scale-7.327.119.2
unmasked-teacher-towards-training-efficient-43.073.065.5
howto100m-learning-a-text-video-embedding-by-7.227.919.6
vid-tldr-training-free-token-merging-for-43.171.464.5
frozen-in-time-a-joint-video-and-image-15.039.830.8
expectation-maximization-contrastive-learning-25.9-46.4
revisiting-temporal-modeling-for-clip-based-29.258.849.5
end-to-end-concept-word-detection-for-video-5.125.216.3
video-and-text-matching-with-conditioned-14.9-33.2
internvideo2-scaling-video-foundation-models-46.4--
clip-vip-adapting-pre-trained-image-text-30.760.651.4
clover-towards-a-unified-video-language-24.854.544
hunyuan-tvr-for-text-video-retrivial56.429.755.446.4
diffusionret-generative-text-video-retrieval40.724.454.343.1
multi-modal-transformer-for-video-retrieval-13.238.829.2