MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

Visual language data such as plots, charts, and infographics are ubiquitousin the human world. However, state-of-the-art vision-language models do notperform well on these data. We propose MatCha (Math reasoning and Chartderendering pretraining) to enhance visual language models' capabilities injointly modeling charts/plots and language data. Specifically, we proposeseveral pretraining tasks that cover plot deconstruction and numericalreasoning which are the key capabilities in visual language modeling. We perform the MatCha pretraining starting from Pix2Struct, a recentlyproposed image-to-text visual language model. On standard benchmarks such asPlotQA and ChartQA, the MatCha model outperforms state-of-the-art methods by asmuch as nearly 20%. We also examine how well MatCha pretraining transfers todomains such as screenshots, textbook diagrams, and document figures andobserve overall improvement, verifying the usefulness of MatCha pretraining onbroader visual language tasks.