HyperAI
Startseite
Neuigkeiten
Neueste Forschungsarbeiten
Tutorials
Datensätze
Wiki
SOTA
LLM-Modelle
GPU-Rangliste
Veranstaltungen
Suche
Über
Deutsch
HyperAI
Toggle sidebar
Seite durchsuchen…
⌘
K
Startseite
SOTA
Zeroshot Video Question Answer
Zero Shot Video Question Answer On Egoschema 1
Zero Shot Video Question Answer On Egoschema 1
Metriken
Accuracy
Ergebnisse
Leistungsergebnisse verschiedener Modelle zu diesem Benchmark
Columns
Modellname
Accuracy
Paper Title
Repository
TimeChat (7B)
33.0
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
VideoChat2_mistral
54.4
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Tarsier (34B)
61.7
Tarsier: Recipes for Training and Evaluating Large Video Description Models
LLoVi (GPT-3.5)
50.3
A Simple LLM Framework for Long-Range Video Question-Answering
InternVideo
32.1
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
VideoTree (GPT4)
61.1
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Vamos (GPT-4o)
53.6
Vamos: Versatile Action Models for Video Understanding
-
MVU (13B)
37.6
Understanding Long Videos with Multimodal Language Models
SeViLA (4B)
22.7
Self-Chained Image-Language Model for Video Localization and Question Answering
LVNet
61.1
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
BIMBA-LLaVA-Qwen2-7B
71.14
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
-
Random
20.0
-
-
Vamos (GPT-4)
48.3
Vamos: Versatile Action Models for Video Understanding
-
VideoChat2_HD_mistral
55.8
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
VideoLLaMA2 (72B)
63.9
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Video-RAG (Based on LLaVA-Video)
66.7
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
mPLUG-Owl (7B)
31.1
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
LinVT-Qwen2-VL(7B)
69.5
LinVT: Empower Your Image-level Large Language Model to Understand Videos
TraveLER
53.3
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
VideoChat2_phi3
56.7
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
0 of 27 row(s) selected.
Previous
Next