HyperAIHyperAI

Command Palette

Search for a command to run...

MinerU-Diffusion: إعادة التفكير في التعرف الضوئي على الحروف (OCR) للمستندات على أنها عملية عكس التمثيل الرسومي عبر فك تشفير الانتشار (Diffusion Decoding)

Hejun Dong Junbo Niu Bin Wang Weijun Zeng Wentao Zhang Conghui He

الملخص

لقد تطورت تقنية التعرف الضوئي على الحروف (OCR) من مجرد نسخ على مستوى السطور إلى تحليل هيكلي للمستندات، مما يتطلب من النماذج استعادة تسلسلات طويلة تتضمن التخطيط العام والجداول والمعادلات الرياضية. وعلى الرغم من التقدم المحرز مؤخرًا في نماذج الرؤية واللغة، فإن معظم الأنظمة الحالية تعتمد على فك الترميز التوليدي الذاتي (autoregressive decoding)، وهو ما يُدخل تأخرًا زمنيًا تسلسليًا ويُضخم انتشار الأخطاء في المستندات الطويلة. في هذا العمل، نعيد النظر في تقنية التعرف الضوئي على الحروف للمستندات من منظور "التصيير العكسي" (inverse rendering)، مع حجة أن التوليد السببي من اليسار إلى اليمين هو مجرد أثر ناتج عن عملية التسلسل وليس سمة جوهرية للمهمة. مستندين إلى هذه الرؤية، نقترح إطار عمل موحد يعتمد على نماذج الانتشار (Diffusion) يُسمى MinerU-Diffusion، يحل محل فك الترميز التسلسلي التوليدي الذاتي بعمليات إزالة الضوضاء المتوازية في نماذج الانتشار تحت شروط بصرية. يستخدم MinerU-Diffusion مزيل ضوضاء (decoder) للانتشار يعمل على مستوى الكتل (block-wise)، بالإضافة إلى استراتيجية تعلم مناهج (curriculum learning) مدفوعة بعدم اليقين، لتمكين التدريب المستقر والاستدلال الفعال للتسلسلات الطويلة. وتُظهر التجارب الواسعة أن إطار عمل MinerU-Diffusion يحسن بشكل متسق من المتانة، مع تحقيق سرعة في فك الترميز تصل إلى 3.2 ضعف مقارنة بالأسس التوليدية الذاتية. كما تؤكد التقييمات على معيار الاختبار المقترح "Semantic Shuffle" تقليل اعتماده على المسبقات اللغوية وقدرته الأقوى في التعرف الضوئي على الحروف المعتمد على البصريات.

One-sentence Summary

Researchers from Shanghai Artificial Intelligence Laboratory and Peking University propose MinerU-Diffusion, a unified diffusion framework that replaces autoregressive decoding with block-wise parallel denoising to achieve up to 3.2× faster inference while reducing semantic hallucinations in complex document OCR tasks.

Key Contributions

  • The paper introduces MinerU-Diffusion, a unified framework that reformulates document OCR as an inverse rendering problem by replacing autoregressive sequential decoding with parallel diffusion denoising under visual conditioning.
  • A block-wise diffusion decoder combined with an uncertainty-driven curriculum learning strategy is employed to enable stable training and efficient inference for long document sequences while mitigating error propagation.
  • Experiments on the Semantic Shuffle benchmark and full-document parsing tasks demonstrate that the method achieves up to 3.2× faster decoding than autoregressive baselines while reducing dependence on linguistic priors and improving robustness against semantic perturbations.

Introduction

Document OCR has shifted toward Vision-Language Models that parse complex layouts, tables, and formulas, yet current systems rely on autoregressive decoding which creates sequential latency and amplifies error propagation in long documents. These autoregressive approaches also force models to depend heavily on linguistic priors, leading to semantic hallucinations when visual signals are weak or document structures are disrupted. The authors leverage an inverse rendering perspective to introduce MinerU-Diffusion, a unified framework that replaces sequential generation with parallel diffusion denoising under visual conditioning. By employing a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy, this method achieves up to 3.2× faster decoding while significantly improving robustness and reducing reliance on language models for text reconstruction.

Dataset

  • Dataset Composition and Sources: The authors construct a large-scale, diverse foundational dataset called Dbase\mathcal{D}_{base}Dbase (also referred to as Dhas\mathcal{D}_{has\ell}Dhas in the text) derived entirely from the MinerU2.5 dataset. This collection contains approximately 7.5 million samples focused on Chinese and English document parsing tasks, with no dedicated evaluation performed for low-resource languages.

  • Key Details for the Subset: The data is curated to satisfy a high-entropy distribution pdiv(x)p_{div}(x)pdiv(x) that covers diverse layouts, languages, document types, and visual styles. Although the dataset contains moderate annotation noise, its massive scale and variety are designed to support robust cross-domain generalization and stable feature learning.

  • Usage in Model Training: In Stage I, the authors use Dbase\mathcal{D}_{base}Dbase for diversity-driven foundational learning to establish robust representations and general parsing abilities. Training on this dataset yields a smooth loss landscape that facilitates stable convergence and emphasizes broad visual-semantic alignment across multiple document understanding tasks.

  • Processing and Curation: The dataset is built through a process of data curation and automated annotation refinement. The authors prioritize diversity and balance over perfect annotation quality in this initial stage to ensure the model learns stable features from a wide range of document structures.

Method

The authors model document OCR as the inverse rendering of a unified structured token sequence, where the output y\boldsymbol{y}y encompasses text symbols, layout markers, table delimiters, and mathematical operators within a shared vocabulary V\mathcal{V}V. This unified representation allows for the encoding of heterogeneous document elements, such as paragraphs, tables, and formulas, within a single sequential interface. Although serialized as a one-dimensional sequence, the underlying structure is two-dimensional, and the statistical dependencies arise primarily from spatial arrangement rather than intrinsic causal generation order. Consequently, the authors frame OCR output as a spatially coupled discrete random field. Refer to the framework diagram to see how the model maps a 2D document image to a 1D token sequence for decoding through autoregressive and diffusion-based methods. Unlike autoregressive decoding which imposes a fixed causal order, the diffusion-based approach introduces a discrete diffusion process that enables global iterative refinement under visual conditioning.

To address the computational and structural limitations of full-attention diffusion on long documents, the authors introduce MinerU-Diffusion, a block-attention dVLM. The output sequence is partitioned into BBB contiguous blocks, factorizing the conditional posterior to allow for parallel diffusion refinement within blocks while preserving a coarse autoregressive structure across blocks. This hybrid factorization prevents long-range alignment drift while maintaining parallel efficiency. Refer to the training diagram which illustrates how the target token sequence is randomly masked to form a partially observed input, and the model predicts only the masked positions under visual and prompt conditioning. A structured attention mask is applied where tokens attend bidirectionally within each block and causally to all preceding blocks, reducing complexity from O(L2)O(L^2)O(L2) to O(BL2)O(BL'^2)O(BL′2).

The training process employs a two-stage curriculum learning framework to leverage large-scale heterogeneous data and alleviate performance bottlenecks caused by noisy labels. In Stage I, the model undergoes large-scale OCR adaptation on an easier subset of data to establish foundational structure understanding. Stage II introduces uncertainty-driven boundary refinement, where hard samples are identified via inference consistency and processed through an AI-assisted human annotation pipeline for high-precision labels. The iterative refinement capability of the model is demonstrated in the layout decoding examples, where the model progressively decodes masks to reveal the final structured text and layout tags. Similarly, the formula recognition examples show the model generating complex LaTeX expressions through multiple diffusion steps, refining the output from initial masks to the final mathematical notation.

Experiment

  • Full-document parsing experiments on OmniDocBench validate that MinerU-Diffusion achieves strong end-to-end performance without oracle layout information, though a performance gap remains when layout prediction is imperfect, highlighting layout understanding as a primary bottleneck.
  • Table and formula recognition evaluations demonstrate that the model maintains structural integrity and competitive accuracy against autoregressive baselines, with particular strength in preserving table structures during diffusive decoding.
  • Analyses of confidence thresholds and decoding parallelism reveal a controllable trade-off where lower thresholds significantly boost inference throughput while higher thresholds improve structural consistency and accuracy, with a specific threshold identified as an optimal balance point.
  • Comparisons of decoding strategies show that dynamic scheduling outperforms static approaches by adaptively selecting tokens to reduce error accumulation while maintaining higher efficiency than fixed-step methods.
  • Ablation studies on attention mechanisms confirm that Block-Attn offers superior scalability and stability compared to Full-Attn by mitigating memory costs and preventing repetitive generation artifacts common in long sequences.
  • Curriculum learning experiments validate that a two-stage training framework effectively stabilizes optimization and refines boundaries, significantly outperforming single-stage approaches in challenging settings without ground truth layout.
  • Semantic Shuffle benchmark results indicate that diffusion-based decoding relies more directly on visual signals than autoregressive models, which tend to degrade sharply when semantic coherence is removed, suggesting greater robustness to linguistic priors.

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp