HyperAI

Multimodal ArXiv Scientific Understanding Dataset

Date

a year ago

Organization

The University of Hong Kong
Download Help

Multimodal ArXiv was launched by the University of Hong Kong and Peking University in 2024. The relevant paper is "Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models", which has been accepted by ACL 2024.

This dataset consists of ArXivCap and ArXivQA to enhance the scientific understanding of LVLM.

ArXivCap is a graph caption dataset containing 6.4 million images and 3.9 million captions from 572K ArXiv papers covering various scientific fields.

Drawing on ArXivCap, the research team introduced ArXivQA, a question-answering dataset generated by GPT-4V based on scientific graphs through prompts. ArXivQA greatly enhances the mathematical reasoning capabilities of the open source LVLM, achieving an absolute accuracy improvement of 10.4% on the multimodal mathematical reasoning benchmark.