HyperAIHyperAI

Command Palette

Search for a command to run...

DoPTA: Improving Document Layout Analysis using Patch-Text Alignment

SR Nikitha ; Menta Tarun Ram ; Sarkar Mausoom

Abstract

The advent of multimodal learning has brought a significant improvement indocument AI. Documents are now treated as multimodal entities, incorporatingboth textual and visual information for downstream analysis. However, works inthis space are often focused on the textual aspect, using the visual space asauxiliary information. While some works have explored pure vision basedtechniques for document image understanding, they require OCR identified textas input during inference, or do not align with text in their learningprocedure. Therefore, we present a novel image-text alignment techniquespecially designed for leveraging the textual information in document images toimprove performance on visual tasks. Our document encoder model DoPTA - trainedwith this technique demonstrates strong performance on a wide range of documentimage understanding tasks, without requiring OCR during inference. Combinedwith an auxiliary reconstruction objective, DoPTA consistently outperformslarger models, while using significantly lesser pre-training compute. DoPTAalso sets new state-of-the art results on D4LA, and FUNSD, two challengingdocument visual analysis benchmarks.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp