HyperAI超神经

Zeroshot Video Question Answer On Msvd Qa

评估指标

Accuracy
Confidence Score

评测结果

各个模型在此基准测试上的表现结果

模型名称
Accuracy
Confidence Score
Paper TitleRepository
BT-Adapter (zero-shot)67.03.6BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
VideoGPT+72.43.6VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
SlowFast-LLaVA-34B79.94.1SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Video-ChatGPT-7B64.93.3Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
PPLLaVA-7B77.14.0PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance-
Elysium75.83.7Elysium: Exploring Object-level Perception in Videos via MLLM
BT-Adapter (zero-shot)67.03.6BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Video Chat-7B56.32.8VideoChat: Chat-Centric Video Understanding
Flash-VStream80.33.9Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
VILA1.5-40B80.1-VILA: On Pre-training for Visual Language Models
LLaVA-Mini70.94.0LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token-
TS-LLaVA-34B79.44.1TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
IG-VLM-34B79.64.1An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
LLaMA-VID-7B (2 Token)69.73.7LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
VideoChat270.03.9MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Video LLaMA-7B51.62.5Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
LinVT-Qwen2-VL (7B)80.24.4LinVT: Empower Your Image-level Large Language Model to Understand Videos
PLLaVA (34B)79.94.2PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
LLaMA Adapter-7B54.93.1LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
FrozenBiLM33.8-Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
0 of 28 row(s) selected.