2 months ago

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Xu, Yang ; Xu, Yiheng ; Lv, Tengchao ; Cui, Lei ; Wei, Furu ; Wang, Guoxin ; Lu, Yijuan ; Florencio, Dinei ; Zhang, Cha ; Che, Wanxiang ; Zhang, Min ; Zhou, Lidong

View Paper Details

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
Understanding

Abstract

Pre-training of text and layout has proved effective in a variety ofvisually-rich document understanding tasks due to its effective modelarchitecture and the advantage of large-scale unlabeled scanned/digital-borndocuments. We propose LayoutLMv2 architecture with new pre-training tasks tomodel the interaction among text, layout, and image in a single multi-modalframework. Specifically, with a two-stream multi-modal Transformer encoder,LayoutLMv2 uses not only the existing masked visual-language modeling task butalso the new text-image alignment and text-image matching tasks, which make itbetter capture the cross-modality interaction in the pre-training stage.Meanwhile, it also integrates a spatial-aware self-attention mechanism into theTransformer architecture so that the model can fully understand the relativepositional relationship among different text blocks. Experiment results showthat LayoutLMv2 outperforms LayoutLM by a large margin and achieves newstate-of-the-art results on a wide variety of downstream visually-rich documentunderstanding tasks, including FUNSD (0.7895 $\to$ 0.8420), CORD (0.9493 $\to$0.9601), SROIE (0.9524 $\to$ 0.9781), Kleister-NDA (0.8340 $\to$ 0.8520),RVL-CDIP (0.9443 $\to$ 0.9564), and DocVQA (0.7295 $\to$ 0.8672). We made ourmodel and code publicly available at \url{https://aka.ms/layoutlmv2}.