Video Question Answering On Agqa 2 0 Balanced
評価指標
Average Accuracy
評価結果
このベンチマークにおける各モデルのパフォーマンス結果
| モデル名 | Average Accuracy | Paper Title | Repository | 
|---|---|---|---|
| MIST - AIO | 50.96 | MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | |
| GF (uns) - S3D | 53.33 | Glance and Focus: Memory Prompting for Multi-Event Video Question Answering | |
| MIST - CLIP | 54.39 | MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | |
| SHG-VQA (trained from scratch) | 49.2 | Learning Situation Hyper-Graphs for Video Question Answering | |
| AIO - ViT | 48.59 | Glance and Focus: Memory Prompting for Multi-Event Video Question Answering | |
| MMTF | 44.36 | MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering | - | 
| SViTT | 52.7 | SViTT: Temporal Learning of Sparse Video-Text Transformers | |
| GF (sup) - Faster RCNN | 55.08 | Glance and Focus: Memory Prompting for Multi-Event Video Question Answering | 
0 of 8 row(s) selected.