Video Retrieval On Vatex
Metriken
text-to-video R@1
text-to-video R@10
text-to-video R@5
Ergebnisse
Leistungsergebnisse verschiedener Modelle zu diesem Benchmark
Vergleichstabelle
Modellname | text-to-video R@1 | text-to-video R@10 | text-to-video R@5 |
---|---|---|---|
vast-a-vision-audio-subtitle-text-omni-1 | 83.0 | 99.2 | 98.2 |
cross-modal-retrieval-with-querybank | 58.8 | 93.8 | - |
clip2video-mastering-video-text-retrieval-via | 57.3 | 90 | - |
side4video-spatial-temporal-side-network-for | 68.8 | 97.0 | 93.5 |
valor-vision-audio-language-omni-perception | 78.5 | 98.7 | 97.1 |
cap4video-what-can-auxiliary-captions-do-for | 66.6 | 97.0 | 93.1 |
internvideo2-scaling-video-foundation-models | 75.5 | - | - |
gramian-multimodal-representation-learning | 87.7 | 100 | - |
ts2-net-token-shift-and-selection-transformer | 59.1 | 95.2 | - |
lightweight-attentional-feature-fusion-for | 59.1 | 91.7 | - |
unmasked-teacher-towards-training-efficient | 72 | 97.8 | 95.1 |
internvideo-general-video-foundation-models | 71.1 | - | - |
holistic-features-are-almost-sufficient-for | 63.6 | 96.1 | 91.9 |