HyperAI

Zero Shot Video Retrieval On Msr Vtt

Metrics

text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
video-to-text Median Rank
video-to-text R@1
video-to-text R@10
video-to-text R@5

Results

Performance results of various models on this benchmark

Comparison Table
Model Nametext-to-video Median Ranktext-to-video R@1text-to-video R@10text-to-video R@5video-to-text Median Rankvideo-to-text R@1video-to-text R@10video-to-text R@5
languagebind-extending-video-language2.042.876.067.53.038.377.865.8
unmasked-teacher-towards-training-efficient-42.673.164.4-38.669.659.8
videoclip-contrastive-pre-training-for-zero-10.430.022.2----
multi-granularity-correspondence-learning-1-10.7-24.1----
mplug-2-a-modularized-multi-modal-foundation-47.179.069.7----
omnivl-one-foundation-model-for-image-34.666.658.4----
end-to-end-learning-of-visual-representations-9.932.424.0----
hitea-hierarchical-temporal-aware-video-29.962.954.2----
internvideo2-scaling-video-foundation-models-51.982.575.3-50.981.873.4
align-and-prompt-video-and-language-pre824.155.444.7----
revealing-single-frame-bias-for-video-and-34.066.756.7----
multi-modal-transformer-for-video-retrieval66--14.4----
revealing-single-frame-bias-for-video-and-28.459.550.2----
hitea-hierarchical-temporal-aware-video-34.469.960.0----
gramian-multimodal-representation-learning-54.883.9--52.982.9-
clip4clip-an-empirical-study-of-clip-for-end432.066.957.0----
noise-estimation-using-density-estimation-for-8.029.321.3----
internvideo-general-video-foundation-models-40.7---39.6--
clover-towards-a-unified-video-language626.46049.5----
howtocaption-prompting-llms-to-transform337.673.362----
vast-a-vision-audio-subtitle-text-omni-1-49.373.968.3----
taco-token-aware-cascade-contrastive-learning-9.833.425.0----
vid-tldr-training-free-token-merging-for-42.172.463.9-37.769.459.8
one-for-all-video-conversation-is-feasible-40.973.564.7----
advancing-high-resolution-video-language1514.644.134.4----
vatt-transformers-for-multimodal-self49-29.7-----
internvideo2-scaling-video-foundation-models-55.985.178.3-53.784.177.5
object-aware-video-language-pre-training-for8.023.455.647.5----
miles-visual-bert-pre-training-with-injected726.156.947.2----
howtocaption-prompting-llms-to-transform15081.473.2----
imagebind-one-embedding-space-to-bind-them-36.870.061.8----
bridgeformer-bridging-video-text-retrieval7.026.056.446.4----
florence-a-new-foundation-model-for-computer-37.672.663.8----
lat-latent-translation-with-cycle-consistency823.453.344.11217.247.936.2
learning-audio-video-modalities-from-image-19.450.339.5----
languagebind-extending-video-language244.878.770.02.40.975.766.4
frozen-in-time-a-joint-video-and-image7.024.757.246.9----
violet-end-to-end-video-language-transformers-25.959.749.5----
seeing-what-you-miss-vision-language-pre-30.965.054.4----