Zeroshot Video Question Answer On Activitynet

Metrics

Accuracy

Confidence Score

Results

Performance results of various models on this benchmark

Model Name	Accuracy	Confidence Score	Paper Title
MovieChat	45.7	3.1	MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
BT-Adapter (zero-shot)	46.1	3.2	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Tarsier (34B)	61.6	3.7	Tarsier: Recipes for Training and Evaluating Large Video Description Models
VideoChat2	49.1	3.3	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Chat-UniVi	46.1	3.3	Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
LLaMA-VID-13B (2 Token)	47.5	3.3	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
PLLaVA (34B)	60.9	3.7	PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
IG-VLM	58.4	3.5	An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
SlowFast-LLaVA-34B	59.2	3.5	SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
LLaVA-Mini	53.5	3.5	LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
FrozenBiLM	24.7	-	Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Video Chat	26.5	2.2	VideoChat: Chat-Centric Video Understanding
LLaMA-VID-7B (2 Token)	47.4	3.3	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Video-ChatGPT	35.2	2.7	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Flash-VStream	51.9	3.4	Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
Video-LLaVA	45.3	3.3	Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
TS-LLaVA-34B	58.9	3.5	TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Elysium	43.4	2.9	Elysium: Exploring Object-level Perception in Videos via MLLM
PPLLaVA-7B	60.7	3.6	PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
LinVT-Qwen2-VL(7B)	60.1	3.6	LinVT: Empower Your Image-level Large Language Model to Understand Videos

0 of 28 row(s) selected.