Video Retrieval On Msvd
Metrics
text-to-video R@1
video-to-text R@1
Results
Performance results of various models on this benchmark
Comparison Table
Model Name | text-to-video R@1 | video-to-text R@1 |
---|---|---|
internvideo2-scaling-video-foundation-models | 61.4 | 85.2 |
internvideo-general-video-foundation-models | 58.4 | 76.3 |
vlab-enhancing-video-language-pre-training-by | 57.5 | - |
improving-video-text-retrieval-by-multi | 51.8 | 69.3 |
diffusionret-generative-text-video-retrieval | 47.9 | 60.3 |
x-clip-end-to-end-multi-grained-contrastive | 50.4 | 66.8 |
mdmmt-2-multidomain-multimodal-transformer | 56.8 | - |
prototype-based-aleatoric-uncertainty-1 | 47.3 | 68.9 |
clip4clip-an-empirical-study-of-clip-for-end | 46.2 | 62.0 |
diffusionret-generative-text-video-retrieval | 46.6 | 61.9 |
a-straightforward-framework-for-video | 37 | 59.9 |
cap4video-what-can-auxiliary-captions-do-for | 51.8 | 70.0 |
hunyuan-tvr-for-text-video-retrivial | 59.0 | 73.0 |
use-what-you-have-video-retrieval-using | 19.8 | - |
vid-tldr-training-free-token-merging-for | 57.9 | 82.7 |
lightweight-attentional-feature-fusion-for | 45.4 | - |
cross-modal-retrieval-with-querybank | 48.0 | - |
x-pool-cross-modal-language-video-attention | 47.2 | 66.4 |
centerclip-token-clustering-for-efficient | 50.6 | 68.4 |
dual-modal-attention-enhanced-text-video | 48.7 | - |
hunyuan-tvr-for-text-video-retrivial | 58.2 | 69.1 |
side4video-spatial-temporal-side-network-for | 56.1 | - |
frozen-in-time-a-joint-video-and-image | 33.7 | - |
noise-estimation-using-density-estimation-for | 20.3 | - |