LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Self-supervised pre-training techniques have achieved remarkable progress inDocument AI. Most multimodal pre-trained models use a masked language modelingobjective to learn bidirectional representations on the text modality, but theydiffer in pre-training objectives for the image modality. This discrepancy addsdifficulty to multimodal representation learning. In this paper, we propose\textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI withunified text and image masking. Additionally, LayoutLMv3 is pre-trained with aword-patch alignment objective to learn cross-modal alignment by predictingwhether the corresponding image patch of a text word is masked. The simpleunified architecture and training objectives make LayoutLMv3 a general-purposepre-trained model for both text-centric and image-centric Document AI tasks.Experimental results show that LayoutLMv3 achieves state-of-the-art performancenot only in text-centric tasks, including form understanding, receiptunderstanding, and document visual question answering, but also inimage-centric tasks such as document image classification and document layoutanalysis. The code and models are publicly available at\url{https://aka.ms/layoutlmv3}.