Natural Language Moment Retrieval On Mad
评估指标
R@1,IoU=0.1
R@1,IoU=0.3
R@1,IoU=0.5
评测结果
各个模型在此基准测试上的表现结果
模型名称 | R@1,IoU=0.1 | R@1,IoU=0.3 | R@1,IoU=0.5 | Paper Title | Repository |
---|---|---|---|---|---|
ReVisionLLM | 17.3 | 12.7 | 6.7 | ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos | |
RGNet | 12.43 | 9.48 | 5.61 | RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos | |
VLG-Net | 3.50 | 2.63 | 1.61 | MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions | |
VLG-Net + Guidance Model | 5.60 | 4.28 | 2.48 | Localizing Moments in Long Video Via Multimodal Guidance | - |
Random Chance | 0.09 | 0.04 | 0.01 | MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions | |
CLIP | 6.57 | 3.13 | 1.39 | MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions | |
Zero-Shot CLIP + Guidance Model | 9.3 | 4.65 | 2.16 | Localizing Moments in Long Video Via Multimodal Guidance | - |
0 of 7 row(s) selected.