HyperAIHyperAI

Zeroshot Video Question Answer On Msrvtt Qa

Métriques

Accuracy
Confidence Score

Résultats

Résultats de performance de divers modèles sur ce benchmark

Nom du modèle
Accuracy
Confidence Score
Paper TitleRepository
Chat-UniVi-7B55.03.1Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding-
TS-LLaVA-34B66.23.6TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models-
BT-Adapter (zero-shot)51.22.9BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning-
Video-LLaVA-7B59.23.5Video-LLaVA: Learning United Visual Representation by Alignment Before Projection-
LLaMA-VID-7B (2 Token)57.73.2LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models-
IG-VLM63.83.5An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM-
Omni-VideoAssistant55.33.3OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation-
Elysium67.53.2Elysium: Exploring Object-level Perception in Videos via MLLM-
MovieChat52.72.6MovieChat: From Dense Token to Sparse Memory for Long Video Understanding-
SUM-shot+Vicuna56.8-Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos-
CAT-7B62.13.5CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios-
BT-Adapter (zero-shot)51.22.9BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning-
VideoChat254.13.3MVBench: A Comprehensive Multi-modal Video Understanding Benchmark-
Vista-LLaMA-7B60.53.3Vista-LLaMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens-
Tarsier (34B)66.43.7Tarsier: Recipes for Training and Evaluating Large Video Description Models-
Video-LaVIT59.33.3Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization-
Video Chat-7B45.02.5VideoChat: Chat-Centric Video Understanding-
VideoGPT+60.63.6VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding-
Video-ChatGPT-7B49.32.8Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models-
PLLaVA (34B)68.73.6PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning-
0 of 30 row(s) selected.