HyperAIHyperAI

Video Question Answering On Activitynet Qa

Metriken

Accuracy

Ergebnisse

Leistungsergebnisse verschiedener Modelle zu diesem Benchmark

Modellname
Accuracy
Paper TitleRepository
LocVLM-Vid-B+38.2Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs-
E-MN27.1ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering-
VindLU44.7VindLU: A Recipe for Effective Video-and-Language Pretraining-
Video-LLaVA45.3Video-LLaVA: Learning United Visual Representation by Alignment Before Projection-
E-VQA25.1ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering-
VALOR48.6VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset-
E-SA31.8ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering-
BT-Adapter (zero-shot)46.1BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning-
Mirasol3B51.13Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities-
Chat-UniVi-13B46.4Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding-
MA-LMM49.8MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding-
FrozenBiLM+44.8Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models-
MovieChat45.7MovieChat: From Dense Token to Sparse Memory for Long Video Understanding-
Video-ChatGPT35.2Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models-
LLaMA-VID-7B (2 Token)47.4LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models-
VAST50.4VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset-
TESTA (ViT-B/16)45TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding-
GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)61.2Composing Ensembles of Pre-trained Models via Iterative Consensus-
LocVLM-Vid-B37.4Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs-
VideoCoCa56.1VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners-
0 of 36 row(s) selected.
Video Question Answering On Activitynet Qa | SOTA | HyperAI