Video Retrieval On Msr Vtt 1Ka
평가 지표
text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
평가 결과
이 벤치마크에서 각 모델의 성능 결과
비교 표
모델 이름 | text-to-video Median Rank | text-to-video R@1 | text-to-video R@10 | text-to-video R@5 |
---|---|---|---|---|
meltr-meta-loss-transformer-for-learning-to | 4 | 31.1 | 68.3 | 55.7 |
side4video-spatial-temporal-side-network-for | 1.0 | 52.3 | 84.2 | 75.5 |
clip2tv-an-empirical-study-on-transformer | 1 | 52.9 | 86.5 | 78.5 |
omnivec-learning-robust-representations-with | - | - | 89.4 | - |
hitea-hierarchical-temporal-aware-video | - | 46.8 | 81.9 | 71.2 |
lightweight-attentional-feature-fusion-for | - | 45.8 | 82 | 71.5 |
holistic-features-are-almost-sufficient-for | - | 48.0 | 83.5 | 75.9 |
cots-collaborative-two-stream-vision-language | 2 | 36.8 | 73.2 | 63.8 |
florence-a-new-foundation-model-for-computer | - | 37.6 | 72.6 | 63.8 |
revealing-single-frame-bias-for-video-and | - | 41.5 | 77 | 68.7 |
x-clip-end-to-end-multi-grained-contrastive | 2.0 | 49.3 | 84.8 | 75.8 |
clip4clip-an-empirical-study-of-clip-for-end | 2 | - | 81.6 | - |
taco-token-aware-cascade-contrastive-learning | 4 | 28.4 | 71.2 | 57.8 |
mdmmt-multidomain-multimodal-transformer-for | 2 | 38.9 | 79.7 | 69.0 |
rtq-rethinking-video-language-understanding | - | 53.4 | 84.4 | 76.1 |
unified-coarse-to-fine-alignment-for-video | - | 49.4 | 83.5 | 72.1 |
meltr-meta-loss-transformer-for-learning-to | - | 41.3 | 82.5 | 73.5 |
pidro-parallel-isomeric-attention-with | 1.0 | 55.9 | 87.6 | 79.8 |
revisiting-temporal-modeling-for-clip-based | 1 | 54.1 | 87.8 | 79.5 |
frozen-in-time-a-joint-video-and-image | 3 | 31.0 | 70.5 | 59.5 |
prototype-based-aleatoric-uncertainty-1 | 2.0 | 48.5 | 82.5 | 72.7 |
cross-modal-retrieval-with-querybank | 2 | 47.2 | 83.0 | 73.0 |
ts2-net-token-shift-and-selection-transformer | - | 54.0 | 87.4 | 79.3 |
expectation-maximization-contrastive-learning | - | 46.8 | 83.1 | 73.1 |
meltr-meta-loss-transformer-for-learning-to | 3 | 35.5 | 78.4 | 67.2 |
centerclip-token-clustering-for-efficient | 2 | 48.4 | 82.0 | 73.8 |
bridgeformer-bridging-video-text-retrieval | 7 | 26 | 56.4 | 46.4 |
video-text-as-game-players-hierarchical | 2.0 | 48.6 | 83.4 | 74.6 |
disentangled-representation-learning-for-text | 1 | 53.3 | 87.6 | 80.3 |
expectation-maximization-contrastive-learning | - | 51.6 | 85.3 | 78.1 |
holistic-features-are-almost-sufficient-for | - | 46.8 | 82.6 | 74.3 |
clip-vip-adapting-pre-trained-image-text | 1.0 | 57.7 | 88.2 | 80.5 |
a-straightforward-framework-for-video | 4 | 31.2 | 64.2 | 53.7 |
x-pool-cross-modal-language-video-attention | 2 | 46.9 | 82.2 | 72.8 |
vindlu-a-recipe-for-effective-video-and | - | 46.5 | 80.4 | 71.5 |
clip2video-mastering-video-text-retrieval-via | 2 | 45.6 | 81.7 | 72.6 |
improving-video-text-retrieval-by-multi | 2 | 48.8 | 85.3 | 75.6 |
mplug-2-a-modularized-multi-modal-foundation | - | 53.1 | 84.7 | 77.6 |
a-joint-sequence-fusion-model-for-video | 13 | 10.2 | 43.2 | 31.2 |
use-what-you-have-video-retrieval-using | 6 | 20.9 | 62.4 | 48.8 |
diffusionret-generative-text-video-retrieval | 2.0 | 49.0 | 82.7 | 75.2 |
diffusionret-generative-text-video-retrieval | 2.0 | 48.9 | 83.1 | 75.2 |
dual-modal-attention-enhanced-text-video | 1.0 | 55.5 | 87.1 | 79.4 |
hunyuan-tvr-for-text-video-retrivial | - | 55.0 | - | - |
socratic-models-composing-zero-shot | - | - | - | - |
cap4video-what-can-auxiliary-captions-do-for | 1 | 51.4 | 83.9 | 75.7 |
towards-efficient-and-effective-text-to-video | - | 54.1 | 86.9 | 78.8 |
bridgeformer-bridging-video-text-retrieval | 3 | 37.6 | 75.1 | 64.8 |
omnivec-learning-robust-representations-with | - | - | 78.6 | - |
vlm-task-agnostic-video-language-model-pre | 4 | 28.10 | 67.40 | 55.50 |
howto100m-learning-a-text-video-embedding-by | 9 | 14.9 | 52.8 | 40.2 |
x-2-vlm-all-in-one-pre-trained-model-for | - | 49.6 | 84.2 | 76.7 |
hunyuan-tvr-for-text-video-retrivial | 1.0 | 62.9 | 90.8 | 84.5 |
x-2-vlm-all-in-one-pre-trained-model-for | - | 47.6 | 84.2 | 74.1 |
masked-contrastive-pre-training-for-efficient | 3 | 38.9 | 73.9 | 63.1 |
multi-efficient-video-and-language | - | 54.7 | 86.0 | 77.7 |
multi-modal-transformer-for-video-retrieval | 4 | 26.6 | 69.6 | 57.1 |
multi-modal-transformer-for-video-retrieval | 4 | 24.6 | 67.1 | 54.0 |
all-in-one-exploring-unified-video-language | - | 37.9 | 77.1 | 68.1 |
videoclip-contrastive-pre-training-for-zero | - | 30.9 | 66.8 | 55.4 |
howto100m-learning-a-text-video-embedding-by | 12 | 12.1 | 48.0 | 35.0 |
clover-towards-a-unified-video-language | 2 | 40.5 | 79.4 | 69.8 |
video-text-retrieval-by-supervised-multi | - | 49.8 | 83.9 | 75.1 |