DoPTA: Improving Document Layout Analysis using Patch-Text Alignment

The advent of multimodal learning has brought a significant improvement indocument AI. Documents are now treated as multimodal entities, incorporatingboth textual and visual information for downstream analysis. However, works inthis space are often focused on the textual aspect, using the visual space asauxiliary information. While some works have explored pure vision basedtechniques for document image understanding, they require OCR identified textas input during inference, or do not align with text in their learningprocedure. Therefore, we present a novel image-text alignment techniquespecially designed for leveraging the textual information in document images toimprove performance on visual tasks. Our document encoder model DoPTA - trainedwith this technique demonstrates strong performance on a wide range of documentimage understanding tasks, without requiring OCR during inference. Combinedwith an auxiliary reconstruction objective, DoPTA consistently outperformslarger models, while using significantly lesser pre-training compute. DoPTAalso sets new state-of-the art results on D4LA, and FUNSD, two challengingdocument visual analysis benchmarks.