HyperAI
HyperAI
Home
Console
Docs
News
Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
Terms of Service
Privacy Policy
English
HyperAI
HyperAI
Toggle Sidebar
Search the site…
⌘
K
Command Palette
Search for a command to run...
Console
Home
SOTA
Temporal Relation Extraction
Temporal Relation Extraction On Vinoground
Temporal Relation Extraction On Vinoground
Metrics
Group Score
Text Score
Video Score
Results
Performance results of various models on this benchmark
Columns
Model Name
Group Score
Text Score
Video Score
Paper Title
GPT-4o (CoT)
35
59.2
51
-
GPT-4o
24.6
54
38.2
-
LLaVA-OneVision-Qwen2-72B
21.8
48.4
35.2
LLaVA-OneVision: Easy Visual Task Transfer
Qwen2-VL-72B
17.4
50.4
32.6
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Qwen2-VL-7B
15.2
40.2
32.4
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
LLaVA-OneVision-Qwen2-7B
14.6
41.6
29.4
LLaVA-OneVision: Easy Visual Task Transfer
Gemini-1.5-Pro (CoT)
12.4
37
27.6
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
MiniCPM-2.6
11.2
32.6
29.2
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Claude 3.5 Sonnet
10.6
32.8
28.8
-
Gemini-1.5-Pro
10.2
35.8
22.6
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
InternLM-XC-2.5
9.6
28.8
27.8
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
InternLM-XC-2.5 (CoT)
9
30.8
28.4
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
VideoLLaMA2-72B
8.4
36.2
21.8
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
LLaVA-NeXT-Video-7B (CoT)
6.8
21.8
26.2
-
MA-LMM-Vicuna-7B
6.8
23.8
25.6
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Video-LLaVA-7B
6.6
24.8
25.8
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
LLaVA-NeXT-Video-7B
6.2
21.8
25.6
-
Phi-3.5-Vision
6.2
24
22.4
-
VTimeLLM
5.2
19.4
27
VTimeLLM: Empower LLM to Grasp Video Moments
LLaVA-NeXT-Video-34B (CoT)
5.2
25.8
22.2
-
0 of 24 row(s) selected.
Previous
Next