Command Palette
Search for a command to run...
MinerU-Diffusion: Neukonzeptualisierung von Dokument-OCR als inverse Rendering-Aufgabe mittels Diffusion Decoding
MinerU-Diffusion: Neukonzeptualisierung von Dokument-OCR als inverse Rendering-Aufgabe mittels Diffusion Decoding
Hejun Dong Junbo Niu Bin Wang Weijun Zeng Wentao Zhang Conghui He
Zusammenfassung
Die optische Zeichenerkennung (OCR) hat sich von der zeilenweisen Transkription hin zum strukturierten Dokumenten-Parsing entwickelt, wobei Modelle in der Lage sein müssen, lange Sequenzen mit Layoutinformationen, Tabellen und Formeln zu rekonstruieren. Trotz jüngster Fortschritte bei Vision-Language-Modellen basieren die meisten bestehenden Systeme nach wie vor auf autoregressiver Dekodierung, was sequenzielle Latenzzeiten verursacht und die Fehlerfortpflanzung bei langen Dokumenten verstärkt. In dieser Arbeit betrachten wir die Dokumenten-OCR erneut aus der Perspektive des inversen Renderings und argumentieren, dass die kausale Generierung von links nach rechts ein Artefakt der Serialisierung und keine inhärente Eigenschaft der Aufgabe ist. Motiviert durch diese Erkenntnis stellen wir MinerU-Diffusion vor, ein einheitliches, auf Diffusion basierendes Framework, das die autoregressive sequenzielle Dekodierung durch paralleles Diffusions-Denoising unter visueller Konditionierung ersetzt. MinerU-Diffusion nutzt einen blockweisen Diffusions-Decoder und eine von Unsicherheit geleitete Curriculum-Learning-Strategie, um ein stabiles Training und eine effiziente Inferenz für lange Sequenzen zu ermöglichen. Zahlreiche Experimente belegen, dass MinerU-Diffusion die Robustheit konsistent verbessert und gleichzeitig im Vergleich zu autoregressiven Baseline-Modellen bis zu 3,2-fach schnellere Dekodierungsraten erreicht. Evaluierungen auf dem neu vorgestellten Semantic-Shuffle-Benchmark bestätigen zudem die verringerte Abhängigkeit von linguistischen Priors sowie die gesteigerte visuelle OCR-Fähigkeit des Ansatzes.
One-sentence Summary
Researchers from Shanghai Artificial Intelligence Laboratory and Peking University propose MinerU-Diffusion, a unified diffusion framework that replaces autoregressive decoding with block-wise parallel denoising to achieve up to 3.2× faster inference while reducing semantic hallucinations in complex document OCR tasks.
Key Contributions
- The paper introduces MinerU-Diffusion, a unified framework that reformulates document OCR as an inverse rendering problem by replacing autoregressive sequential decoding with parallel diffusion denoising under visual conditioning.
- A block-wise diffusion decoder combined with an uncertainty-driven curriculum learning strategy is employed to enable stable training and efficient inference for long document sequences while mitigating error propagation.
- Experiments on the Semantic Shuffle benchmark and full-document parsing tasks demonstrate that the method achieves up to 3.2× faster decoding than autoregressive baselines while reducing dependence on linguistic priors and improving robustness against semantic perturbations.
Introduction
Document OCR has shifted toward Vision-Language Models that parse complex layouts, tables, and formulas, yet current systems rely on autoregressive decoding which creates sequential latency and amplifies error propagation in long documents. These autoregressive approaches also force models to depend heavily on linguistic priors, leading to semantic hallucinations when visual signals are weak or document structures are disrupted. The authors leverage an inverse rendering perspective to introduce MinerU-Diffusion, a unified framework that replaces sequential generation with parallel diffusion denoising under visual conditioning. By employing a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy, this method achieves up to 3.2× faster decoding while significantly improving robustness and reducing reliance on language models for text reconstruction.
Dataset
-
Dataset Composition and Sources: The authors construct a large-scale, diverse foundational dataset called Dbase (also referred to as Dhasℓ in the text) derived entirely from the MinerU2.5 dataset. This collection contains approximately 7.5 million samples focused on Chinese and English document parsing tasks, with no dedicated evaluation performed for low-resource languages.
-
Key Details for the Subset: The data is curated to satisfy a high-entropy distribution pdiv(x) that covers diverse layouts, languages, document types, and visual styles. Although the dataset contains moderate annotation noise, its massive scale and variety are designed to support robust cross-domain generalization and stable feature learning.
-
Usage in Model Training: In Stage I, the authors use Dbase for diversity-driven foundational learning to establish robust representations and general parsing abilities. Training on this dataset yields a smooth loss landscape that facilitates stable convergence and emphasizes broad visual-semantic alignment across multiple document understanding tasks.
-
Processing and Curation: The dataset is built through a process of data curation and automated annotation refinement. The authors prioritize diversity and balance over perfect annotation quality in this initial stage to ensure the model learns stable features from a wide range of document structures.
Method
The authors model document OCR as the inverse rendering of a unified structured token sequence, where the output y encompasses text symbols, layout markers, table delimiters, and mathematical operators within a shared vocabulary V. This unified representation allows for the encoding of heterogeneous document elements, such as paragraphs, tables, and formulas, within a single sequential interface. Although serialized as a one-dimensional sequence, the underlying structure is two-dimensional, and the statistical dependencies arise primarily from spatial arrangement rather than intrinsic causal generation order. Consequently, the authors frame OCR output as a spatially coupled discrete random field. Refer to the framework diagram to see how the model maps a 2D document image to a 1D token sequence for decoding through autoregressive and diffusion-based methods. Unlike autoregressive decoding which imposes a fixed causal order, the diffusion-based approach introduces a discrete diffusion process that enables global iterative refinement under visual conditioning.
To address the computational and structural limitations of full-attention diffusion on long documents, the authors introduce MinerU-Diffusion, a block-attention dVLM. The output sequence is partitioned into B contiguous blocks, factorizing the conditional posterior to allow for parallel diffusion refinement within blocks while preserving a coarse autoregressive structure across blocks. This hybrid factorization prevents long-range alignment drift while maintaining parallel efficiency. Refer to the training diagram which illustrates how the target token sequence is randomly masked to form a partially observed input, and the model predicts only the masked positions under visual and prompt conditioning. A structured attention mask is applied where tokens attend bidirectionally within each block and causally to all preceding blocks, reducing complexity from O(L2) to O(BL′2).
The training process employs a two-stage curriculum learning framework to leverage large-scale heterogeneous data and alleviate performance bottlenecks caused by noisy labels. In Stage I, the model undergoes large-scale OCR adaptation on an easier subset of data to establish foundational structure understanding. Stage II introduces uncertainty-driven boundary refinement, where hard samples are identified via inference consistency and processed through an AI-assisted human annotation pipeline for high-precision labels. The iterative refinement capability of the model is demonstrated in the layout decoding examples, where the model progressively decodes masks to reveal the final structured text and layout tags. Similarly, the formula recognition examples show the model generating complex LaTeX expressions through multiple diffusion steps, refining the output from initial masks to the final mathematical notation.
Experiment
- Full-document parsing experiments on OmniDocBench validate that MinerU-Diffusion achieves strong end-to-end performance without oracle layout information, though a performance gap remains when layout prediction is imperfect, highlighting layout understanding as a primary bottleneck.
- Table and formula recognition evaluations demonstrate that the model maintains structural integrity and competitive accuracy against autoregressive baselines, with particular strength in preserving table structures during diffusive decoding.
- Analyses of confidence thresholds and decoding parallelism reveal a controllable trade-off where lower thresholds significantly boost inference throughput while higher thresholds improve structural consistency and accuracy, with a specific threshold identified as an optimal balance point.
- Comparisons of decoding strategies show that dynamic scheduling outperforms static approaches by adaptively selecting tokens to reduce error accumulation while maintaining higher efficiency than fixed-step methods.
- Ablation studies on attention mechanisms confirm that Block-Attn offers superior scalability and stability compared to Full-Attn by mitigating memory costs and preventing repetitive generation artifacts common in long sequences.
- Curriculum learning experiments validate that a two-stage training framework effectively stabilizes optimization and refines boundaries, significantly outperforming single-stage approaches in challenging settings without ground truth layout.
- Semantic Shuffle benchmark results indicate that diffusion-based decoding relies more directly on visual signals than autoregressive models, which tend to degrade sharply when semantic coherence is removed, suggesting greater robustness to linguistic priors.