Reading Order Matters: Information Extraction from Visually-rich Documents by Token Path Prediction

Recent advances in multimodal pre-trained models have significantly improvedinformation extraction from visually-rich documents (VrDs), in which namedentity recognition (NER) is treated as a sequence-labeling task of predictingthe BIO entity tags for tokens, following the typical setting of NLP. However,BIO-tagging scheme relies on the correct order of model inputs, which is notguaranteed in real-world NER on scanned VrDs where text are recognized andarranged by OCR systems. Such reading order issue hinders the accurate markingof entities by BIO-tagging scheme, making it impossible for sequence-labelingmethods to predict correct named entities. To address the reading order issue,we introduce Token Path Prediction (TPP), a simple prediction head to predictentity mentions as token sequences within documents. Alternative to tokenclassification, TPP models the document layout as a complete directed graph oftokens, and predicts token paths within the graph as entities. For betterevaluation of VrD-NER systems, we also propose two revised benchmark datasetsof NER on scanned documents which can reflect real-world scenarios. Experimentresults demonstrate the effectiveness of our method, and suggest its potentialto be a universal solution to various information extraction tasks ondocuments.