HyperAI
HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
HyperAI
Toggle sidebar
Search the site…
⌘
K
Home
SOTA
Temporal Relation Extraction
Temporal Relation Extraction On Vinoground
Temporal Relation Extraction On Vinoground
Metrics
Group Score
Text Score
Video Score
Results
Performance results of various models on this benchmark
Columns
Model Name
Group Score
Text Score
Video Score
Paper Title
Repository
GPT-4o
24.6
54
38.2
-
-
LLaVA-NeXT-Video-7B
6.2
21.8
25.6
-
-
LLaVA-NeXT-Video-7B (CoT)
6.8
21.8
26.2
-
-
LLaVA-NeXT-Video-34B
3.8
23
21.2
-
-
Qwen2-VL-7B
15.2
40.2
32.4
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
-
GPT-4o (CoT)
35
59.2
51
-
-
Phi-3.5-Vision
6.2
24
22.4
-
-
Claude 3.5 Sonnet
10.6
32.8
28.8
-
-
ImageBind
0.6
9.4
3.4
ImageBind: One Embedding Space To Bind Them All
-
LLaVA-OneVision-Qwen2-72B
21.8
48.4
35.2
LLaVA-OneVision: Easy Visual Task Transfer
-
VTimeLLM
5.2
19.4
27
VTimeLLM: Empower LLM to Grasp Video Moments
-
MA-LMM-Vicuna-7B
6.8
23.8
25.6
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
-
LLaVA-OneVision-Qwen2-7B
14.6
41.6
29.4
LLaVA-OneVision: Easy Visual Task Transfer
-
Gemini-1.5-Pro (CoT)
12.4
37
27.6
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
-
Gemini-1.5-Pro
10.2
35.8
22.6
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
-
InternLM-XC-2.5
9.6
28.8
27.8
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
-
VideoCLIP
1.2
17
2.8
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
-
Video-LLaVA-7B
6.6
24.8
25.8
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
-
LanguageBind
1.2
10.6
5
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
-
VideoLLaMA2-72B
8.4
36.2
21.8
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
-
0 of 24 row(s) selected.
Previous
Next
Temporal Relation Extraction On Vinoground | SOTA | HyperAI