HyperAI

Video Question Answering On Activitynet Qa

Métriques

Accuracy

Résultats

Résultats de performance de divers modèles sur ce benchmark

Tableau comparatif
Nom du modèleAccuracy
learning-to-localize-objects-improves-spatial38.2
activitynet-qa-a-dataset-for-understanding27.1
vindlu-a-recipe-for-effective-video-and44.7
video-llava-learning-united-visual-145.3
activitynet-qa-a-dataset-for-understanding25.1
valor-vision-audio-language-omni-perception48.6
activitynet-qa-a-dataset-for-understanding31.8
one-for-all-video-conversation-is-feasible46.1
mirasol3b-a-multimodal-autoregressive-model51.13
chat-univi-unified-visual-representation46.4
ma-lmm-memory-augmented-large-multimodal49.8
open-vocabulary-video-question-answering-a44.8
moviechat-from-dense-token-to-sparse-memory45.7
video-chatgpt-towards-detailed-video35.2
llama-vid-an-image-is-worth-2-tokens-in-large47.4
vast-a-vision-audio-subtitle-text-omni-150.4
testa-temporal-spatial-token-aggregation-for45
composing-ensembles-of-pre-trained-models-via61.2
learning-to-localize-objects-improves-spatial37.4
video-text-modeling-with-zero-shot-transfer56.1
towards-fast-adaptation-of-pretrained41.4
revealing-single-frame-bias-for-video-and44.1
zero-shot-video-question-answering-via-frozen43.2
open-vocabulary-video-question-answering-a40.0
composing-ensembles-of-pre-trained-models-via58.4
unmasked-teacher-towards-training-efficient47.9
revealing-single-frame-bias-for-video-and43.1
llama-adapter-v2-parameter-efficient-visual34.2
llama-vid-an-image-is-worth-2-tokens-in-large47.5
just-ask-learning-to-answer-questions-from38.9
just-ask-learning-to-answer-questions-from12.2
cosa-concatenated-sample-pretrained-vision49.9
zero-shot-video-question-answering-via-frozen25.9
videochat-chat-centric-video-understanding26.5
mvbench-a-comprehensive-multi-modal-video49.1
open-vocabulary-video-question-answering-a39.7