HyperAI초신경
홈
뉴스
최신 연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
한국어
HyperAI초신경
Toggle sidebar
전체 사이트 검색...
⌘
K
홈
SOTA
Temporal Relation Extraction
Temporal Relation Extraction On Vinoground
Temporal Relation Extraction On Vinoground
평가 지표
Group Score
Text Score
Video Score
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
Group Score
Text Score
Video Score
Paper Title
Repository
GPT-4o
24.6
54
38.2
-
-
LLaVA-NeXT-Video-7B
6.2
21.8
25.6
-
-
LLaVA-NeXT-Video-7B (CoT)
6.8
21.8
26.2
-
-
LLaVA-NeXT-Video-34B
3.8
23
21.2
-
-
Qwen2-VL-7B
15.2
40.2
32.4
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
GPT-4o (CoT)
35
59.2
51
-
-
Phi-3.5-Vision
6.2
24
22.4
-
-
Claude 3.5 Sonnet
10.6
32.8
28.8
-
-
ImageBind
0.6
9.4
3.4
ImageBind: One Embedding Space To Bind Them All
LLaVA-OneVision-Qwen2-72B
21.8
48.4
35.2
LLaVA-OneVision: Easy Visual Task Transfer
VTimeLLM
5.2
19.4
27
VTimeLLM: Empower LLM to Grasp Video Moments
MA-LMM-Vicuna-7B
6.8
23.8
25.6
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
LLaVA-OneVision-Qwen2-7B
14.6
41.6
29.4
LLaVA-OneVision: Easy Visual Task Transfer
Gemini-1.5-Pro (CoT)
12.4
37
27.6
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini-1.5-Pro
10.2
35.8
22.6
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
InternLM-XC-2.5
9.6
28.8
27.8
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
VideoCLIP
1.2
17
2.8
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Video-LLaVA-7B
6.6
24.8
25.8
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
LanguageBind
1.2
10.6
5
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
VideoLLaMA2-72B
8.4
36.2
21.8
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
0 of 24 row(s) selected.
Previous
Next