HyperAI

Multimodal ArXiv was launched by the University of Hong Kong and Peking University in 2024. The relevant paper is "Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models", which has been accepted by ACL 2024.

This dataset consists of ArXivCap and ArXivQA to enhance the scientific understanding of LVLM.

ArXivCap is a graph caption dataset containing 6.4 million images and 3.9 million captions from 572K ArXiv papers covering various scientific fields.

Drawing on ArXivCap, the research team introduced ArXivQA, a question-answering dataset generated by GPT-4V based on scientific graphs through prompts. ArXivQA greatly enhances the mathematical reasoning capabilities of the open source LVLM, achieving an absolute accuracy improvement of 10.4% on the multimodal mathematical reasoning benchmark.

Multimodal ArXiv Scientific Understanding Dataset

Build AI with AI

Hyper Newsletters

Command Palette

Multimodal ArXiv Scientific Understanding Dataset

Build AI with AI

Hyper Newsletters