HyperAI超神经

Temporal Relation Extraction On Vinoground

评估指标

Group Score
Text Score
Video Score

评测结果

各个模型在此基准测试上的表现结果

模型名称
Group Score
Text Score
Video Score
Paper TitleRepository
GPT-4o24.65438.2--
LLaVA-NeXT-Video-7B6.221.825.6--
LLaVA-NeXT-Video-7B (CoT)6.821.826.2--
LLaVA-NeXT-Video-34B3.82321.2--
Qwen2-VL-7B15.240.232.4Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
GPT-4o (CoT)3559.251--
Phi-3.5-Vision6.22422.4--
Claude 3.5 Sonnet10.632.828.8--
ImageBind0.69.43.4ImageBind: One Embedding Space To Bind Them All
LLaVA-OneVision-Qwen2-72B21.848.435.2LLaVA-OneVision: Easy Visual Task Transfer
VTimeLLM5.219.427VTimeLLM: Empower LLM to Grasp Video Moments
MA-LMM-Vicuna-7B6.823.825.6MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
LLaVA-OneVision-Qwen2-7B14.641.629.4LLaVA-OneVision: Easy Visual Task Transfer
Gemini-1.5-Pro (CoT)12.43727.6Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini-1.5-Pro10.235.822.6Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
InternLM-XC-2.59.628.827.8InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
VideoCLIP1.2172.8VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Video-LLaVA-7B6.624.825.8Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
LanguageBind1.210.65LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
VideoLLaMA2-72B8.436.221.8VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
0 of 24 row(s) selected.