Temporal Relation Extraction On Vinoground

Métriques

Group Score

Text Score

Video Score

Résultats

Résultats de performance de divers modèles sur ce benchmark

				Paper Title
GPT-4o (CoT)	35	59.2	51	-
GPT-4o	24.6	54	38.2	-
LLaVA-OneVision-Qwen2-72B	21.8	48.4	35.2	LLaVA-OneVision: Easy Visual Task Transfer
Qwen2-VL-72B	17.4	50.4	32.6	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Qwen2-VL-7B	15.2	40.2	32.4	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
LLaVA-OneVision-Qwen2-7B	14.6	41.6	29.4	LLaVA-OneVision: Easy Visual Task Transfer
Gemini-1.5-Pro (CoT)	12.4	37	27.6	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
MiniCPM-2.6	11.2	32.6	29.2	MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Claude 3.5 Sonnet	10.6	32.8	28.8	-
Gemini-1.5-Pro	10.2	35.8	22.6	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
InternLM-XC-2.5	9.6	28.8	27.8	InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
InternLM-XC-2.5 (CoT)	9	30.8	28.4	InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
VideoLLaMA2-72B	8.4	36.2	21.8	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
LLaVA-NeXT-Video-7B (CoT)	6.8	21.8	26.2	-
MA-LMM-Vicuna-7B	6.8	23.8	25.6	MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Video-LLaVA-7B	6.6	24.8	25.8	Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
LLaVA-NeXT-Video-7B	6.2	21.8	25.6	-
Phi-3.5-Vision	6.2	24	22.4	-
VTimeLLM	5.2	19.4	27	VTimeLLM: Empower LLM to Grasp Video Moments
LLaVA-NeXT-Video-34B (CoT)	5.2	25.8	22.2	-

0 of 24 row(s) selected.

Command Palette

Temporal Relation Extraction On Vinoground

Métriques

Résultats