HyperAI
HyperAI
Accueil
Actualités
Articles de recherche
Tutoriels
Ensembles de données
Wiki
SOTA
Modèles LLM
Classement GPU
Événements
Recherche
À propos
Français
HyperAI
HyperAI
Toggle sidebar
Rechercher sur le site...
⌘
K
Rechercher sur le site...
⌘
K
Accueil
SOTA
Réponse aux questions vidéo
Video Question Answering On Activitynet Qa
Video Question Answering On Activitynet Qa
Métriques
Accuracy
Résultats
Résultats de performance de divers modèles sur ce benchmark
Columns
Nom du modèle
Accuracy
Paper Title
Repository
LocVLM-Vid-B+
38.2
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
E-MN
27.1
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
VindLU
44.7
VindLU: A Recipe for Effective Video-and-Language Pretraining
Video-LLaVA
45.3
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
E-VQA
25.1
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
VALOR
48.6
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
E-SA
31.8
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
BT-Adapter (zero-shot)
46.1
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Mirasol3B
51.13
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
-
Chat-UniVi-13B
46.4
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
MA-LMM
49.8
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
FrozenBiLM+
44.8
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
MovieChat
45.7
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Video-ChatGPT
35.2
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
LLaMA-VID-7B (2 Token)
47.4
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
VAST
50.4
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
TESTA (ViT-B/16)
45
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)
61.2
Composing Ensembles of Pre-trained Models via Iterative Consensus
-
LocVLM-Vid-B
37.4
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
VideoCoCa
56.1
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
-
0 of 36 row(s) selected.
Previous
Next