Zeroshot Video Question Answer On Activitynet

評価指標

Accuracy
Confidence Score

評価結果

このベンチマークにおける各モデルのパフォーマンス結果

モデル名
Accuracy
Confidence Score
Paper TitleRepository
MovieChat45.73.1MovieChat: From Dense Token to Sparse Memory for Long Video Understanding-
BT-Adapter (zero-shot)46.13.2BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning-
Tarsier (34B)61.63.7Tarsier: Recipes for Training and Evaluating Large Video Description Models-
VideoChat249.13.3MVBench: A Comprehensive Multi-modal Video Understanding Benchmark-
Chat-UniVi46.13.3Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding-
LLaMA-VID-13B (2 Token)47.53.3LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models-
PLLaVA (34B)60.93.7PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning-
IG-VLM58.43.5An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM-
SlowFast-LLaVA-34B59.23.5SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models-
LLaVA-Mini53.53.5LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token-
FrozenBiLM24.7-Zero-Shot Video Question Answering via Frozen Bidirectional Language Models-
Video Chat26.52.2VideoChat: Chat-Centric Video Understanding-
LLaMA-VID-7B (2 Token)47.43.3LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models-
Video-ChatGPT35.22.7Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models-
Flash-VStream51.93.4Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams-
Video-LLaVA45.33.3Video-LLaVA: Learning United Visual Representation by Alignment Before Projection-
TS-LLaVA-34B58.93.5TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models-
Elysium43.42.9Elysium: Exploring Object-level Perception in Videos via MLLM-
PPLLaVA-7B60.73.6PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance-
LinVT-Qwen2-VL(7B)60.13.6LinVT: Empower Your Image-level Large Language Model to Understand Videos-
0 of 28 row(s) selected.
Zeroshot Video Question Answer On Activitynet | SOTA | HyperAI超神経