HyperAI

Temporal Relation Extraction On Vinoground

Métriques

Group Score
Text Score
Video Score

Résultats

Résultats de performance de divers modèles sur ce benchmark

Nom du modèle
Group Score
Text Score
Video Score
Paper TitleRepository
GPT-4o24.65438.2--
LLaVA-NeXT-Video-7B6.221.825.6--
LLaVA-NeXT-Video-7B (CoT)6.821.826.2--
LLaVA-NeXT-Video-34B3.82321.2--
Qwen2-VL-7B15.240.232.4Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
GPT-4o (CoT)3559.251--
Phi-3.5-Vision6.22422.4--
Claude 3.5 Sonnet10.632.828.8--
ImageBind0.69.43.4ImageBind: One Embedding Space To Bind Them All
LLaVA-OneVision-Qwen2-72B21.848.435.2LLaVA-OneVision: Easy Visual Task Transfer
VTimeLLM5.219.427VTimeLLM: Empower LLM to Grasp Video Moments
MA-LMM-Vicuna-7B6.823.825.6MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
LLaVA-OneVision-Qwen2-7B14.641.629.4LLaVA-OneVision: Easy Visual Task Transfer
Gemini-1.5-Pro (CoT)12.43727.6Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini-1.5-Pro10.235.822.6Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
InternLM-XC-2.59.628.827.8InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
VideoCLIP1.2172.8VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Video-LLaVA-7B6.624.825.8Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
LanguageBind1.210.65LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
VideoLLaMA2-72B8.436.221.8VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
0 of 24 row(s) selected.