HyperAIHyperAI

Temporal Relation Extraction On Vinoground

Metrics

Group Score
Text Score
Video Score

Results

Performance results of various models on this benchmark

Model Name
Group Score
Text Score
Video Score
Paper TitleRepository
GPT-4o24.65438.2--
LLaVA-NeXT-Video-7B6.221.825.6--
LLaVA-NeXT-Video-7B (CoT)6.821.826.2--
LLaVA-NeXT-Video-34B3.82321.2--
Qwen2-VL-7B15.240.232.4Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution-
GPT-4o (CoT)3559.251--
Phi-3.5-Vision6.22422.4--
Claude 3.5 Sonnet10.632.828.8--
ImageBind0.69.43.4ImageBind: One Embedding Space To Bind Them All-
LLaVA-OneVision-Qwen2-72B21.848.435.2LLaVA-OneVision: Easy Visual Task Transfer-
VTimeLLM5.219.427VTimeLLM: Empower LLM to Grasp Video Moments-
MA-LMM-Vicuna-7B6.823.825.6MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding-
LLaVA-OneVision-Qwen2-7B14.641.629.4LLaVA-OneVision: Easy Visual Task Transfer-
Gemini-1.5-Pro (CoT)12.43727.6Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context-
Gemini-1.5-Pro10.235.822.6Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context-
InternLM-XC-2.59.628.827.8InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output-
VideoCLIP1.2172.8VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding-
Video-LLaVA-7B6.624.825.8Video-LLaVA: Learning United Visual Representation by Alignment Before Projection-
LanguageBind1.210.65LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment-
VideoLLaMA2-72B8.436.221.8VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs-
0 of 24 row(s) selected.
Temporal Relation Extraction On Vinoground | SOTA | HyperAI