Video Question Answering On Activitynet Qa

Métriques

Accuracy

Résultats

Résultats de performance de divers modèles sur ce benchmark

Nom du modèle	Accuracy	Paper Title	Repository
LocVLM-Vid-B+	38.2	Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs	-
E-MN	27.1	ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering	-
VindLU	44.7	VindLU: A Recipe for Effective Video-and-Language Pretraining	-
Video-LLaVA	45.3	Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	-
E-VQA	25.1	ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering	-
VALOR	48.6	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	-
E-SA	31.8	ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering	-
BT-Adapter (zero-shot)	46.1	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning	-
Mirasol3B	51.13	Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities	-
Chat-UniVi-13B	46.4	Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding	-
MA-LMM	49.8	MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	-
FrozenBiLM+	44.8	Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models	-
MovieChat	45.7	MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	-
Video-ChatGPT	35.2	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	-
LLaMA-VID-7B (2 Token)	47.4	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	-
VAST	50.4	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	-
TESTA (ViT-B/16)	45	TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding	-
GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	61.2	Composing Ensembles of Pre-trained Models via Iterative Consensus	-
LocVLM-Vid-B	37.4	Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs	-
VideoCoCa	56.1	VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners	-

0 of 36 row(s) selected.