HyperAI
HyperAI초신경
홈
플랫폼
문서
뉴스
연구 논문
튜토리얼
데이터셋
백과사전
SOTA
LLM 모델
GPU 랭킹
컨퍼런스
전체 검색
소개
서비스 약관
개인정보 처리방침
한국어
HyperAI
HyperAI초신경
Toggle Sidebar
전체 사이트 검색...
⌘
K
Command Palette
Search for a command to run...
플랫폼
홈
SOTA
비디오 질문 답변
Video Question Answering On Activitynet Qa
Video Question Answering On Activitynet Qa
평가 지표
Accuracy
평가 결과
이 벤치마크에서 각 모델의 성능 결과
Columns
모델 이름
Accuracy
Paper Title
GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)
61.2
Composing Ensembles of Pre-trained Models via Iterative Consensus
GPT-2 + CLIP-32 (Zero-Shot)
58.4
Composing Ensembles of Pre-trained Models via Iterative Consensus
VideoCoCa
56.1
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
Mirasol3B
51.13
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
VAST
50.4
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
COSA
49.9
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
MA-LMM
49.8
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
VideoChat2
49.1
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
VALOR
48.6
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
UMT-L (ViT-L/16)
47.9
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
LLaMA-VID-13B (2 Token)
47.5
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
LLaMA-VID-7B (2 Token)
47.4
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Chat-UniVi-13B
46.4
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
BT-Adapter (zero-shot)
46.1
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
MovieChat
45.7
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Video-LLaVA
45.3
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
TESTA (ViT-B/16)
45
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
FrozenBiLM+
44.8
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
VindLU
44.7
VindLU: A Recipe for Effective Video-and-Language Pretraining
Singularity-temporal
44.1
Revealing Single Frame Bias for Video-and-Language Learning
0 of 36 row(s) selected.
Previous
Next
Video Question Answering On Activitynet Qa | SOTA | HyperAI초신경