HyperAI
HyperAI초신경
홈
플랫폼
문서
뉴스
연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
서비스 약관
개인정보 처리방침
한국어
HyperAI
HyperAI초신경
Toggle Sidebar
전체 사이트 검색...
⌘
K
Command Palette
Search for a command to run...
플랫폼
홈
SOTA
비디오 질문 답변
Video Question Answering On Next Qa
Video Question Answering On Next Qa
평가 지표
Accuracy
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
Accuracy
Paper Title
LinVT-Qwen2-VL (7B)
85.5
LinVT: Empower Your Image-level Large Language Model to Understand Videos
InternVL-2.5(8B)
85.5
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
VideoLLaMA3(7B)
84.5
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
BIMBA-LLaVA-Qwen2-7B
83.73
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
LLaVA-Video
83.2
Video Instruction Tuning With Synthetic Data
NVILA(8B)
82.2
NVILA: Efficient Frontier Visual Language Models
Oryx-1.5(7B)
81.8
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Qwen2-VL(7B)
81.2
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
LongVILA(7B)
80.7
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
LLaVA-OV(72B)
80.2
LLaVA-OneVision: Easy Visual Task Transfer
VideoChat2_HD_mistral
79.5
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
LLaVA-OV(7B)
79.4
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-NeXT-Interleave(14B)
79.1
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
VideoChat2_mistral
78.6
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
mPLUG-Owl3(8B)
78.6
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
LLaVA-NeXT-Interleave(7B)
78.2
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave(DPO)
77.9
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Vamos
77.3
Vamos: Versatile Action Models for Video Understanding
ViLA (3B)
75.6
ViLA: Efficient Video-Language Alignment for Video Question Answering
VideoLLaMA2.1(7B)
75.6
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
0 of 44 row(s) selected.
Previous
Next
Video Question Answering On Next Qa | SOTA | HyperAI초신경