Natural Language Moment Retrieval On
Metrics
R@1,IoU=0.5
R@1,IoU=0.7
R@5,IoU=0.5
R@5,IoU=0.7
Results
Performance results of various models on this benchmark
Model Name | R@1,IoU=0.5 | R@1,IoU=0.7 | R@5,IoU=0.5 | R@5,IoU=0.7 | Paper Title | Repository |
---|---|---|---|---|---|---|
DRN | 45.45 | 24.36 | 77.97 | 50.30 | Dense Regression Network for Video Grounding | |
GVL | 49.18 | 29.69 | - | - | Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos | |
UnLoc-B | 48.0 | 29.7 | 81.5 | 61.4 | UnLoc: A Unified Framework for Video Localization Tasks | |
UnLoc-L | 48.3 | 30.2 | 79.2 | 61.3 | UnLoc: A Unified Framework for Video Localization Tasks | |
GVL (paragraph-level) | 60.67 | 38.55 | - | - | Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos | |
LLaVA-MR | 55.16 | 35.68 | - | - | LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval | - |
VLG-Net | 46.32 | 29.82 | 77.15 | 63.33 | VLG-Net: Video-Language Graph Matching Network for Video Grounding | |
UniMD+Sync. | - | - | 80.54 | 57.04 | UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection |
0 of 8 row(s) selected.