Zeroshot Video Question Answer On Msrvtt Qa

المقاييس

Accuracy

Confidence Score

النتائج

نتائج أداء النماذج المختلفة على هذا المعيار القياسي

			Paper Title
Flash-VStream	72.4	3.4	Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
PLLaVA (34B)	68.7	3.6	PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Elysium	67.5	3.2	Elysium: Exploring Object-level Perception in Videos via MLLM
SlowFast-LLaVA-34B	67.4	3.7	SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Tarsier (34B)	66.4	3.7	Tarsier: Recipes for Training and Evaluating Large Video Description Models
TS-LLaVA-34B	66.2	3.6	TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
LinVT-Qwen2-VL (7B)	66.2	4.0	LinVT: Empower Your Image-level Large Language Model to Understand Videos
PPLLaVA-7B	64.3	3.5	PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
IG-VLM	63.8	3.5	An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
ST-LLM	63.2	3.4	ST-LLM: Large Language Models Are Effective Temporal Learners
CAT-7B	62.1	3.5	CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
VideoGPT+	60.6	3.6	VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Vista-LLaMA-7B	60.5	3.3	Vista-LLaMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens
MiniGPT4-video-7B	59.73	-	MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
LLaVA-Mini	59.5	3.6	LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Video-LaVIT	59.3	3.3	Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Video-LLaVA-7B	59.2	3.5	Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
LLaMA-VID-13B (2 Token)	58.9	3.3	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA-VID-7B (2 Token)	57.7	3.2	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
SUM-shot+Vicuna	56.8	-	Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

0 of 30 row(s) selected.

Command Palette

Zeroshot Video Question Answer On Msrvtt Qa

المقاييس

النتائج