Zeroshot Video Question Answer On Activitynet

Métriques

Accuracy

Confidence Score

Résultats

Résultats de performance de divers modèles sur ce benchmark

			Paper Title
Tarsier (34B)	61.6	3.7	Tarsier: Recipes for Training and Evaluating Large Video Description Models
PLLaVA (34B)	60.9	3.7	PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
PPLLaVA-7B	60.7	3.6	PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
LinVT-Qwen2-VL(7B)	60.1	3.6	LinVT: Empower Your Image-level Large Language Model to Understand Videos
SlowFast-LLaVA-34B	59.2	3.5	SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
TS-LLaVA-34B	58.9	3.5	TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
IG-VLM	58.4	3.5	An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
LLaVA-Mini	53.5	3.5	LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Flash-VStream	51.9	3.4	Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
ST-LLM	50.9	3.3	ST-LLM: Large Language Models Are Effective Temporal Learners
VideoGPT+	50.6	3.6	VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
CAT-7B	50.2	3.5	CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
Video-LaVIT	50.1	3.3	Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
VideoChat2	49.1	3.3	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
LLaMA-VID-13B (2 Token)	47.5	3.3	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA-VID-7B (2 Token)	47.4	3.3	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Chat-UniVi-13B	46.4	3.6	Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
MiniGPT4-video-7B	46.3	-	MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
BT-Adapter (zero-shot)	46.1	3.2	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Chat-UniVi	46.1	3.3	Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

0 of 28 row(s) selected.

Command Palette

Zeroshot Video Question Answer On Activitynet

Métriques

Résultats