Temporal Relation Extraction On Vinoground

평가 지표

Group Score

Text Score

Video Score

평가 결과

이 벤치마크에서 각 모델의 성능 결과

모델 이름	Group Score	Text Score	Video Score	Paper Title	Repository
GPT-4o	24.6	54	38.2	-	-
LLaVA-NeXT-Video-7B	6.2	21.8	25.6	-	-
LLaVA-NeXT-Video-7B (CoT)	6.8	21.8	26.2	-	-
LLaVA-NeXT-Video-34B	3.8	23	21.2	-	-
Qwen2-VL-7B	15.2	40.2	32.4	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
GPT-4o (CoT)	35	59.2	51	-	-
Phi-3.5-Vision	6.2	24	22.4	-	-
Claude 3.5 Sonnet	10.6	32.8	28.8	-	-
ImageBind	0.6	9.4	3.4	ImageBind: One Embedding Space To Bind Them All
LLaVA-OneVision-Qwen2-72B	21.8	48.4	35.2	LLaVA-OneVision: Easy Visual Task Transfer
VTimeLLM	5.2	19.4	27	VTimeLLM: Empower LLM to Grasp Video Moments
MA-LMM-Vicuna-7B	6.8	23.8	25.6	MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
LLaVA-OneVision-Qwen2-7B	14.6	41.6	29.4	LLaVA-OneVision: Easy Visual Task Transfer
Gemini-1.5-Pro (CoT)	12.4	37	27.6	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini-1.5-Pro	10.2	35.8	22.6	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
InternLM-XC-2.5	9.6	28.8	27.8	InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
VideoCLIP	1.2	17	2.8	VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Video-LLaVA-7B	6.6	24.8	25.8	Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
LanguageBind	1.2	10.6	5	LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
VideoLLaMA2-72B	8.4	36.2	21.8	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

0 of 24 row(s) selected.