8 months ago

Abstract

Visual information extraction (VIE) plays an important role in DocumentIntelligence. Generally, it is divided into two tasks: semantic entityrecognition (SER) and relation extraction (RE). Recently, pre-trained modelsfor documents have achieved substantial progress in VIE, particularly in SER.However, most of the existing models learn the geometric representation in animplicit way, which has been found insufficient for the RE task since geometricinformation is especially crucial for RE. Moreover, we reveal another factorthat limits the performance of RE lies in the objective gap between thepre-training phase and the fine-tuning phase for RE. To tackle these issues, wepropose in this paper a multi-modal framework, named GeoLayoutLM, for VIE.GeoLayoutLM explicitly models the geometric relations in pre-training, which wecall geometric pre-training. Geometric pre-training is achieved by threespecially designed geometry-related pre-training tasks. Additionally, novelrelation heads, which are pre-trained by the geometric pre-training tasks andfine-tuned for RE, are elaborately designed to enrich and enhance the featurerepresentation. According to extensive experiments on standard VIE benchmarks,GeoLayoutLM achieves highly competitive scores in the SER task andsignificantly outperforms the previous state-of-the-arts for RE (\eg, the F1score of RE on FUNSD is boosted from 80.35% to 89.45%). The code and modelsare publicly available athttps://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/GeoLayoutLM

Source PDF