HyperAI

Zeroshot Video Question Answer On Activitynet

Métriques

Accuracy
Confidence Score

Résultats

Résultats de performance de divers modèles sur ce benchmark

Nom du modèle
Accuracy
Confidence Score
Paper TitleRepository
MovieChat45.73.1MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
BT-Adapter (zero-shot)46.13.2BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Tarsier (34B)61.63.7Tarsier: Recipes for Training and Evaluating Large Video Description Models
VideoChat249.13.3MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Chat-UniVi46.13.3Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
LLaMA-VID-13B (2 Token)47.53.3LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
PLLaVA (34B)60.93.7PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
IG-VLM58.43.5An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
SlowFast-LLaVA-34B59.23.5SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
LLaVA-Mini53.53.5LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token-
FrozenBiLM24.7-Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Video Chat26.52.2VideoChat: Chat-Centric Video Understanding
LLaMA-VID-7B (2 Token)47.43.3LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Video-ChatGPT35.22.7Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Flash-VStream51.93.4Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
Video-LLaVA45.33.3Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
TS-LLaVA-34B58.93.5TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Elysium43.42.9Elysium: Exploring Object-level Perception in Videos via MLLM
PPLLaVA-7B60.73.6PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance-
LinVT-Qwen2-VL(7B)60.13.6LinVT: Empower Your Image-level Large Language Model to Understand Videos
0 of 28 row(s) selected.