Command Palette
Search for a command to run...
MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding
MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding
Hejun Dong Junbo Niu Bin Wang Weijun Zeng Wentao Zhang Conghui He
Abstract
Optical character recognition (OCR) has evolved from line-level transcription to structured document parsing, requiring models to recover long-form sequences containing layout, tables, and formulas. Despite recent advances in vision-language models, most existing systems rely on autoregressive decoding, which introduces sequential latency and amplifies error propagation in long documents. In this work, we revisit document OCR from an inverse rendering perspective, arguing that left-to-right causal generation is an artifact of serialization rather than an intrinsic property of the task. Motivated by this insight, we propose MinerU-Diffusion, a unified diffusion-based framework that replaces autoregressive sequential decoding with parallel diffusion denoising under visual conditioning. MinerU-Diffusion employs a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy to enable stable training and efficient long-sequence inference. Extensive experiments demonstrate that MinerU-Diffusion consistently improves robustness while achieving up to 3.2x faster decoding compared to autoregressive baselines. Evaluations on the proposed Semantic Shuffle benchmark further confirm its reduced dependence on linguistic priors and stronger visual OCR capability.
One-sentence Summary
Researchers from Shanghai Artificial Intelligence Laboratory and Peking University propose MinerU-Diffusion, a unified diffusion framework that replaces autoregressive decoding with block-wise parallel denoising to achieve up to 3.2× faster inference while reducing semantic hallucinations in complex document OCR tasks.
Key Contributions
- The paper introduces MinerU-Diffusion, a unified framework that reformulates document OCR as an inverse rendering problem by replacing autoregressive sequential decoding with parallel diffusion denoising under visual conditioning.
- A block-wise diffusion decoder combined with an uncertainty-driven curriculum learning strategy is employed to enable stable training and efficient inference for long document sequences while mitigating error propagation.
- Experiments on the Semantic Shuffle benchmark and full-document parsing tasks demonstrate that the method achieves up to 3.2× faster decoding than autoregressive baselines while reducing dependence on linguistic priors and improving robustness against semantic perturbations.
Introduction
Document OCR has shifted toward Vision-Language Models that parse complex layouts, tables, and formulas, yet current systems rely on autoregressive decoding which creates sequential latency and amplifies error propagation in long documents. These autoregressive approaches also force models to depend heavily on linguistic priors, leading to semantic hallucinations when visual signals are weak or document structures are disrupted. The authors leverage an inverse rendering perspective to introduce MinerU-Diffusion, a unified framework that replaces sequential generation with parallel diffusion denoising under visual conditioning. By employing a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy, this method achieves up to 3.2× faster decoding while significantly improving robustness and reducing reliance on language models for text reconstruction.
Dataset
-
Dataset Composition and Sources: The authors construct a large-scale, diverse foundational dataset called Dbase (also referred to as Dhasℓ in the text) derived entirely from the MinerU2.5 dataset. This collection contains approximately 7.5 million samples focused on Chinese and English document parsing tasks, with no dedicated evaluation performed for low-resource languages.
-
Key Details for the Subset: The data is curated to satisfy a high-entropy distribution pdiv(x) that covers diverse layouts, languages, document types, and visual styles. Although the dataset contains moderate annotation noise, its massive scale and variety are designed to support robust cross-domain generalization and stable feature learning.
-
Usage in Model Training: In Stage I, the authors use Dbase for diversity-driven foundational learning to establish robust representations and general parsing abilities. Training on this dataset yields a smooth loss landscape that facilitates stable convergence and emphasizes broad visual-semantic alignment across multiple document understanding tasks.
-
Processing and Curation: The dataset is built through a process of data curation and automated annotation refinement. The authors prioritize diversity and balance over perfect annotation quality in this initial stage to ensure the model learns stable features from a wide range of document structures.
Method
The authors model document OCR as the inverse rendering of a unified structured token sequence, where the output y encompasses text symbols, layout markers, table delimiters, and mathematical operators within a shared vocabulary V. This unified representation allows for the encoding of heterogeneous document elements, such as paragraphs, tables, and formulas, within a single sequential interface. Although serialized as a one-dimensional sequence, the underlying structure is two-dimensional, and the statistical dependencies arise primarily from spatial arrangement rather than intrinsic causal generation order. Consequently, the authors frame OCR output as a spatially coupled discrete random field. Refer to the framework diagram to see how the model maps a 2D document image to a 1D token sequence for decoding through autoregressive and diffusion-based methods. Unlike autoregressive decoding which imposes a fixed causal order, the diffusion-based approach introduces a discrete diffusion process that enables global iterative refinement under visual conditioning.
To address the computational and structural limitations of full-attention diffusion on long documents, the authors introduce MinerU-Diffusion, a block-attention dVLM. The output sequence is partitioned into B contiguous blocks, factorizing the conditional posterior to allow for parallel diffusion refinement within blocks while preserving a coarse autoregressive structure across blocks. This hybrid factorization prevents long-range alignment drift while maintaining parallel efficiency. Refer to the training diagram which illustrates how the target token sequence is randomly masked to form a partially observed input, and the model predicts only the masked positions under visual and prompt conditioning. A structured attention mask is applied where tokens attend bidirectionally within each block and causally to all preceding blocks, reducing complexity from O(L2) to O(BL′2).
The training process employs a two-stage curriculum learning framework to leverage large-scale heterogeneous data and alleviate performance bottlenecks caused by noisy labels. In Stage I, the model undergoes large-scale OCR adaptation on an easier subset of data to establish foundational structure understanding. Stage II introduces uncertainty-driven boundary refinement, where hard samples are identified via inference consistency and processed through an AI-assisted human annotation pipeline for high-precision labels. The iterative refinement capability of the model is demonstrated in the layout decoding examples, where the model progressively decodes masks to reveal the final structured text and layout tags. Similarly, the formula recognition examples show the model generating complex LaTeX expressions through multiple diffusion steps, refining the output from initial masks to the final mathematical notation.
Experiment
- Full-document parsing experiments on OmniDocBench validate that MinerU-Diffusion achieves strong end-to-end performance without oracle layout information, though a performance gap remains when layout prediction is imperfect, highlighting layout understanding as a primary bottleneck.
- Table and formula recognition evaluations demonstrate that the model maintains structural integrity and competitive accuracy against autoregressive baselines, with particular strength in preserving table structures during diffusive decoding.
- Analyses of confidence thresholds and decoding parallelism reveal a controllable trade-off where lower thresholds significantly boost inference throughput while higher thresholds improve structural consistency and accuracy, with a specific threshold identified as an optimal balance point.
- Comparisons of decoding strategies show that dynamic scheduling outperforms static approaches by adaptively selecting tokens to reduce error accumulation while maintaining higher efficiency than fixed-step methods.
- Ablation studies on attention mechanisms confirm that Block-Attn offers superior scalability and stability compared to Full-Attn by mitigating memory costs and preventing repetitive generation artifacts common in long sequences.
- Curriculum learning experiments validate that a two-stage training framework effectively stabilizes optimization and refines boundaries, significantly outperforming single-stage approaches in challenging settings without ground truth layout.
- Semantic Shuffle benchmark results indicate that diffusion-based decoding relies more directly on visual signals than autoregressive models, which tend to degrade sharply when semantic coherence is removed, suggesting greater robustness to linguistic priors.