Video Question Answering On Activitynet Qa

평가 지표

Accuracy

평가 결과

이 벤치마크에서 각 모델의 성능 결과

		Paper Title
GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	61.2	Composing Ensembles of Pre-trained Models via Iterative Consensus
GPT-2 + CLIP-32 (Zero-Shot)	58.4	Composing Ensembles of Pre-trained Models via Iterative Consensus
VideoCoCa	56.1	VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
Mirasol3B	51.13	Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
VAST	50.4	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
COSA	49.9	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
MA-LMM	49.8	MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
VideoChat2	49.1	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
VALOR	48.6	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
UMT-L (ViT-L/16)	47.9	Unmasked Teacher: Towards Training-Efficient Video Foundation Models
LLaMA-VID-13B (2 Token)	47.5	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA-VID-7B (2 Token)	47.4	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Chat-UniVi-13B	46.4	Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
BT-Adapter (zero-shot)	46.1	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
MovieChat	45.7	MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Video-LLaVA	45.3	Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
TESTA (ViT-B/16)	45	TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
FrozenBiLM+	44.8	Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
VindLU	44.7	VindLU: A Recipe for Effective Video-and-Language Pretraining
Singularity-temporal	44.1	Revealing Single Frame Bias for Video-and-Language Learning

0 of 36 row(s) selected.

Command Palette

Video Question Answering On Activitynet Qa

평가 지표

평가 결과