HyperAI
HyperAI
Accueil
Actualités
Articles de recherche
Tutoriels
Ensembles de données
Wiki
SOTA
Modèles LLM
Classement GPU
Événements
Recherche
À propos
Français
HyperAI
HyperAI
Toggle sidebar
Rechercher sur le site...
⌘
K
Rechercher sur le site...
⌘
K
Accueil
SOTA
Réponse aux questions vidéo
Video Question Answering On Next Qa
Video Question Answering On Next Qa
Métriques
Accuracy
Résultats
Résultats de performance de divers modèles sur ce benchmark
Columns
Nom du modèle
Accuracy
Paper Title
Repository
LLaVA-Video
83.2
Video Instruction Tuning With Synthetic Data
-
LLaVA-NeXT-Interleave(14B)
79.1
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
ATM
58.3
ATM: Action Temporality Modeling for Video Question Answering
-
VideoChat2_HD_mistral
79.5
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
ViperGPT(0-shot)
60.0
ViperGPT: Visual Inference via Python Execution for Reasoning
LongVILA(7B)
80.7
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
VGT(PT)
56.9
Video Graph Transformer for Video Question Answering
TCR
73.5
Text-Conditioned Resampler For Long Form Video Understanding
-
ViLA (3B)
75.6
ViLA: Efficient Video-Language Alignment for Video Question Answering
HiTeA
63.1
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
-
HQGA
51.4
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering
RTQ
63.2
RTQ: Rethinking Video-language Understanding Based on Image-text Model
GF
58.83
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
LSTP
72.1
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
LLaMA-VQA (33B)
75.5
Large Language Models are Temporal and Causal Reasoners for Video Question Answering
CoVGT(PT)
60.7
Contrastive Video Question Answering via Video Graph Transformer
SeViT
60.6
Semi-Parametric Video-Grounded Text Generation
-
VideoChat2_mistral
78.6
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Vamos
77.3
Vamos: Versatile Action Models for Video Understanding
LinVT-Qwen2-VL (7B)
85.5
LinVT: Empower Your Image-level Large Language Model to Understand Videos
0 of 44 row(s) selected.
Previous
Next