HyperAIHyperAI

Video Question Answering On Next Qa

Metrics

Accuracy

Results

Performance results of various models on this benchmark

Model Name
Accuracy
Paper TitleRepository
LLaVA-Video83.2Video Instruction Tuning With Synthetic Data-
LLaVA-NeXT-Interleave(14B)79.1LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
ATM58.3ATM: Action Temporality Modeling for Video Question Answering-
VideoChat2_HD_mistral79.5MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
ViperGPT(0-shot)60.0ViperGPT: Visual Inference via Python Execution for Reasoning
LongVILA(7B)80.7LongVILA: Scaling Long-Context Visual Language Models for Long Videos
VGT(PT)56.9Video Graph Transformer for Video Question Answering
TCR73.5Text-Conditioned Resampler For Long Form Video Understanding-
ViLA (3B)75.6ViLA: Efficient Video-Language Alignment for Video Question Answering
HiTeA63.1HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
HQGA51.4Video as Conditional Graph Hierarchy for Multi-Granular Question Answering
RTQ63.2RTQ: Rethinking Video-language Understanding Based on Image-text Model
GF58.83Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
LSTP72.1Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
LLaMA-VQA (33B)75.5Large Language Models are Temporal and Causal Reasoners for Video Question Answering
CoVGT(PT)60.7Contrastive Video Question Answering via Video Graph Transformer
SeViT60.6Semi-Parametric Video-Grounded Text Generation-
VideoChat2_mistral78.6MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Vamos77.3Vamos: Versatile Action Models for Video Understanding
LinVT-Qwen2-VL (7B)85.5LinVT: Empower Your Image-level Large Language Model to Understand Videos
0 of 44 row(s) selected.