HyperAI

Video Retrieval On Activitynet

Metriken

text-to-video Median Rank
text-to-video R@1
text-to-video R@5
text-to-video R@50

Ergebnisse

Leistungsergebnisse verschiedener Modelle zu diesem Benchmark

Modellname
text-to-video Median Rank
text-to-video R@1
text-to-video R@5
text-to-video R@50
Paper TitleRepository
HD-VILA428.557.494Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Singularity-47.175.5-Revealing Single Frame Bias for Video-and-Language Learning
RTQ-53.581.4-RTQ: Rethinking Video-language Understanding Based on Image-text Model
MMT-Pretrained3.328.761.494.5Multi-modal Transformer for Video Retrieval
Ours-25.459.1-Video and Text Matching with Conditioned Embeddings
CAMoE151.077.7-Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
InternVideo-62.2--InternVideo: General Video Foundation Models via Generative and Discriminative Learning
DiffusionRet2.045.875.6-DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
X-CLIP-46.275.5-X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
HiTeA-49.777.1-HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
VALOR-70.190.8-VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
CLIP4Clip240.573.498.2CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
HBI2.042.273.0-Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
EMCL-Net-41.272.7-Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
EMCL-Net++-50.678.798.1Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
COSA-67.3--COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
DiffusionRet+QB-Norm2.048.1--DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
MMT522.754.293.2Multi-modal Transformer for Video Retrieval
CLIP-ViP161.485.7-CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
DMAE (ViT-B/32)1.053.480.7-Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning-
0 of 31 row(s) selected.