HyperAI초신경
홈
뉴스
최신 연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
한국어
시스템 설정
HyperAI초신경
Toggle sidebar
전체 사이트 검색...
⌘
K
로그인
로그인
홈
SOTA
Zeroshot Video Question Answer
Zero Shot Video Question Answer On Next Qa
Zero Shot Video Question Answer On Next Qa
평가 지표
Accuracy
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
Accuracy
Paper Title
Repository
VideoChat2
61.7
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
VidCtx (7B)
70.7
VidCtx: Context-aware Video Question Answering with Image Models
-
LongVA(32 frames)
67.1
Long Context Transfer from Language to Vision
IG-VLM(LLaVA v1.6)
70.9
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
MVU (13B)
55.2
Understanding Long Videos with Multimodal Language Models
Q-ViD
66.3
Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering
VideoTree (GPT4)
73.5
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Sevila (4B)
63.6
Self-Chained Image-Language Model for Video Localization and Question Answering
TraveLER (GPT-4)
68.2
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
ProViQ
64.6
Zero-Shot Video Question Answering with Procedural Programs
-
Tarsier (34B)
79.2
Tarsier: Recipes for Training and Evaluating Large Video Description Models
ViperGPT (GPT-3.5)
60.0
ViperGPT: Visual Inference via Python Execution for Reasoning
LLoVi (GPT-4)
67.7
A Simple LLM Framework for Long-Range Video Question-Answering
MoReVQA(PaLM-2)
69.2
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
-
IG-VLM (GPT-4)
68.6
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
DeepStack-L(7B)
61.0
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
-
VideoAgent (GPT-4)
71.3
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
LVNet(GPT-4o)
72.9
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
LLoVi (7B)
54.3
A Simple LLM Framework for Long-Range Video Question-Answering
TS-LLaVA-34B
73.6
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
0 of 25 row(s) selected.
Previous
Next