2 months ago

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

Xu, Yiheng ; Lv, Tengchao ; Cui, Lei ; Wang, Guoxin ; Lu, Yijuan ; Florencio, Dinei ; Zhang, Cha ; Wei, Furu

Abstract

Multimodal pre-training with text, layout, and image has achieved SOTAperformance for visually-rich document understanding tasks recently, whichdemonstrates the great potential for joint learning across differentmodalities. In this paper, we present LayoutXLM, a multimodal pre-trained modelfor multilingual document understanding, which aims to bridge the languagebarriers for visually-rich document understanding. To accurately evaluateLayoutXLM, we also introduce a multilingual form understanding benchmarkdataset named XFUND, which includes form understanding samples in 7 languages(Chinese, Japanese, Spanish, French, Italian, German, Portuguese), andkey-value pairs are manually labeled for each language. Experiment results showthat the LayoutXLM model has significantly outperformed the existing SOTAcross-lingual pre-trained models on the XFUND dataset. The pre-trainedLayoutXLM model and the XFUND dataset are publicly available athttps://aka.ms/layoutxlm.