Multimodal ArXiv Scientific Understanding Dataset
Date
Publish URL
Tags
Multimodal ArXiv was launched by the University of Hong Kong and Peking University in 2024. The relevant paper is "Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models", which has been accepted by ACL 2024.
This dataset consists of ArXivCap and ArXivQA to enhance the scientific understanding of LVLM.
ArXivCap is a graph caption dataset containing 6.4 million images and 3.9 million captions from 572K ArXiv papers covering various scientific fields.
Drawing on ArXivCap, the research team introduced ArXivQA, a question-answering dataset generated by GPT-4V based on scientific graphs through prompts. ArXivQA greatly enhances the mathematical reasoning capabilities of the open source LVLM, achieving an absolute accuracy improvement of 10.4% on the multimodal mathematical reasoning benchmark.