HyperAI超神経

Zero Shot Video Question Answer On Next Qa

評価指標

Accuracy

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名
Accuracy
Paper TitleRepository
VideoChat261.7MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
VidCtx (7B)70.7VidCtx: Context-aware Video Question Answering with Image Models-
LongVA(32 frames)67.1Long Context Transfer from Language to Vision
IG-VLM(LLaVA v1.6)70.9An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
MVU (13B)55.2Understanding Long Videos with Multimodal Language Models
Q-ViD66.3Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering
VideoTree (GPT4)73.5VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Sevila (4B)63.6Self-Chained Image-Language Model for Video Localization and Question Answering
TraveLER (GPT-4)68.2TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
ProViQ64.6Zero-Shot Video Question Answering with Procedural Programs-
Tarsier (34B)79.2Tarsier: Recipes for Training and Evaluating Large Video Description Models
ViperGPT (GPT-3.5)60.0ViperGPT: Visual Inference via Python Execution for Reasoning
LLoVi (GPT-4)67.7A Simple LLM Framework for Long-Range Video Question-Answering
MoReVQA(PaLM-2)69.2MoReVQA: Exploring Modular Reasoning Models for Video Question Answering-
IG-VLM (GPT-4)68.6An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
DeepStack-L(7B)61.0DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs-
VideoAgent (GPT-4)71.3VideoAgent: Long-form Video Understanding with Large Language Model as Agent
LVNet(GPT-4o)72.9Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
LLoVi (7B)54.3A Simple LLM Framework for Long-Range Video Question-Answering
TS-LLaVA-34B73.6TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
0 of 25 row(s) selected.