HyperAI초신경
홈
뉴스
최신 연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
한국어
시스템 설정
HyperAI초신경
Toggle sidebar
전체 사이트 검색...
⌘
K
로그인
로그인
홈
SOTA
Video Question Answering
Video Question Answering On Mvbench
Video Question Answering On Mvbench
평가 지표
Avg.
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
Avg.
Paper Title
Repository
ST-LLM
54.9
ST-LLM: Large Language Models Are Effective Temporal Learners
Tarsier (34B)
67.6
Tarsier: Recipes for Training and Evaluating Large Video Description Models
PPLLaVA (7b)
59.2
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
-
MiniGPT4
18.8
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
VideoChat
35.5
VideoChat: Chat-Centric Video Understanding
Oryx(34B)
64.7
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
InstructBLIP
32.5
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
LongVU (7B)
66.9
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
VideoLLaMA
34.1
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
LinVT-Qwen2-VL (7B)
69.3
LinVT: Empower Your Image-level Large Language Model to Understand Videos
HawkEye
47.55
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
Video-ChatGPT
32.7
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
SPHINX-Plus
39.7
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
VideoLLaMA2 (72B)
62.0
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoChat2
51.9
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
mPLUG-Owl3(7B)
59.5
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
PLLaVA
58.1
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
InternVideo2
67.2
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VideoGPT+
58.7
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
LLaVa
36.0
Visual Instruction Tuning
0 of 21 row(s) selected.
Previous
Next