Video Question Answering On Activitynet Qa
평가 지표
Accuracy
평가 결과
이 벤치마크에서 각 모델의 성능 결과
비교 표
모델 이름 | Accuracy |
---|---|
learning-to-localize-objects-improves-spatial | 38.2 |
activitynet-qa-a-dataset-for-understanding | 27.1 |
vindlu-a-recipe-for-effective-video-and | 44.7 |
video-llava-learning-united-visual-1 | 45.3 |
activitynet-qa-a-dataset-for-understanding | 25.1 |
valor-vision-audio-language-omni-perception | 48.6 |
activitynet-qa-a-dataset-for-understanding | 31.8 |
one-for-all-video-conversation-is-feasible | 46.1 |
mirasol3b-a-multimodal-autoregressive-model | 51.13 |
chat-univi-unified-visual-representation | 46.4 |
ma-lmm-memory-augmented-large-multimodal | 49.8 |
open-vocabulary-video-question-answering-a | 44.8 |
moviechat-from-dense-token-to-sparse-memory | 45.7 |
video-chatgpt-towards-detailed-video | 35.2 |
llama-vid-an-image-is-worth-2-tokens-in-large | 47.4 |
vast-a-vision-audio-subtitle-text-omni-1 | 50.4 |
testa-temporal-spatial-token-aggregation-for | 45 |
composing-ensembles-of-pre-trained-models-via | 61.2 |
learning-to-localize-objects-improves-spatial | 37.4 |
video-text-modeling-with-zero-shot-transfer | 56.1 |
towards-fast-adaptation-of-pretrained | 41.4 |
revealing-single-frame-bias-for-video-and | 44.1 |
zero-shot-video-question-answering-via-frozen | 43.2 |
open-vocabulary-video-question-answering-a | 40.0 |
composing-ensembles-of-pre-trained-models-via | 58.4 |
unmasked-teacher-towards-training-efficient | 47.9 |
revealing-single-frame-bias-for-video-and | 43.1 |
llama-adapter-v2-parameter-efficient-visual | 34.2 |
llama-vid-an-image-is-worth-2-tokens-in-large | 47.5 |
just-ask-learning-to-answer-questions-from | 38.9 |
just-ask-learning-to-answer-questions-from | 12.2 |
cosa-concatenated-sample-pretrained-vision | 49.9 |
zero-shot-video-question-answering-via-frozen | 25.9 |
videochat-chat-centric-video-understanding | 26.5 |
mvbench-a-comprehensive-multi-modal-video | 49.1 |
open-vocabulary-video-question-answering-a | 39.7 |