HyperAIHyperAI

Command Palette

Search for a command to run...

MinerU-Diffusion: Diffusion によるデコーディングを介した逆レンダリングとしての文書 OCR の再考

Hejun Dong Junbo Niu Bin Wang Weijun Zeng Wentao Zhang Conghui He

概要

文字認識(OCR)技術は、行単位の写し取りから構造化された文書解析へと進化し、レイアウト、表、数式を含む長文シーケンスの復元をモデルに要求するようになっています。視覚言語モデルの近年の進展にもかかわらず、既存のシステムの多くは自己回帰的デコーディングに依存しており、これにより長文書において逐次遅延が生じ、誤りの伝播が増幅されるという課題があります。本研究では、文書 OCR を逆レンダリングの観点から見直し、左から右への因果的生成は、タスクの本質的な性質というよりは、直列化に由来する人工的な産物であると主張します。この洞察に着想を得て、視覚条件付きの下で自己回帰的逐次デコーディングを並列拡散ノイズ除去に置き換える、統合された拡散ベースのフレームワーク「MinerU-Diffusion」を提案します。MinerU-Diffusion は、ブロック単位拡散デコーダと不確実性駆動型のカリキュラム学習戦略を採用し、安定した学習と効率的な長シーケンス推論を実現します。広範な実験により、MinerU-Diffusion は自己回帰的ベースラインと比較して最大 3.2 倍の高速なデコーディングを達成しつつ、一貫してロバスト性を向上させることが示されました。新たに提案した Semantic Shuffle ベンチマークにおける評価は、本手法が言語的先行知識への依存を低減し、視覚的 OCR 能力を強化していることをさらに裏付けています。

One-sentence Summary

Researchers from Shanghai Artificial Intelligence Laboratory and Peking University propose MinerU-Diffusion, a unified diffusion framework that replaces autoregressive decoding with block-wise parallel denoising to achieve up to 3.2× faster inference while reducing semantic hallucinations in complex document OCR tasks.

Key Contributions

  • The paper introduces MinerU-Diffusion, a unified framework that reformulates document OCR as an inverse rendering problem by replacing autoregressive sequential decoding with parallel diffusion denoising under visual conditioning.
  • A block-wise diffusion decoder combined with an uncertainty-driven curriculum learning strategy is employed to enable stable training and efficient inference for long document sequences while mitigating error propagation.
  • Experiments on the Semantic Shuffle benchmark and full-document parsing tasks demonstrate that the method achieves up to 3.2× faster decoding than autoregressive baselines while reducing dependence on linguistic priors and improving robustness against semantic perturbations.

Introduction

Document OCR has shifted toward Vision-Language Models that parse complex layouts, tables, and formulas, yet current systems rely on autoregressive decoding which creates sequential latency and amplifies error propagation in long documents. These autoregressive approaches also force models to depend heavily on linguistic priors, leading to semantic hallucinations when visual signals are weak or document structures are disrupted. The authors leverage an inverse rendering perspective to introduce MinerU-Diffusion, a unified framework that replaces sequential generation with parallel diffusion denoising under visual conditioning. By employing a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy, this method achieves up to 3.2× faster decoding while significantly improving robustness and reducing reliance on language models for text reconstruction.

Dataset

  • Dataset Composition and Sources: The authors construct a large-scale, diverse foundational dataset called Dbase\mathcal{D}_{base}Dbase (also referred to as Dhas\mathcal{D}_{has\ell}Dhas in the text) derived entirely from the MinerU2.5 dataset. This collection contains approximately 7.5 million samples focused on Chinese and English document parsing tasks, with no dedicated evaluation performed for low-resource languages.

  • Key Details for the Subset: The data is curated to satisfy a high-entropy distribution pdiv(x)p_{div}(x)pdiv(x) that covers diverse layouts, languages, document types, and visual styles. Although the dataset contains moderate annotation noise, its massive scale and variety are designed to support robust cross-domain generalization and stable feature learning.

  • Usage in Model Training: In Stage I, the authors use Dbase\mathcal{D}_{base}Dbase for diversity-driven foundational learning to establish robust representations and general parsing abilities. Training on this dataset yields a smooth loss landscape that facilitates stable convergence and emphasizes broad visual-semantic alignment across multiple document understanding tasks.

  • Processing and Curation: The dataset is built through a process of data curation and automated annotation refinement. The authors prioritize diversity and balance over perfect annotation quality in this initial stage to ensure the model learns stable features from a wide range of document structures.

Method

The authors model document OCR as the inverse rendering of a unified structured token sequence, where the output y\boldsymbol{y}y encompasses text symbols, layout markers, table delimiters, and mathematical operators within a shared vocabulary V\mathcal{V}V. This unified representation allows for the encoding of heterogeneous document elements, such as paragraphs, tables, and formulas, within a single sequential interface. Although serialized as a one-dimensional sequence, the underlying structure is two-dimensional, and the statistical dependencies arise primarily from spatial arrangement rather than intrinsic causal generation order. Consequently, the authors frame OCR output as a spatially coupled discrete random field. Refer to the framework diagram to see how the model maps a 2D document image to a 1D token sequence for decoding through autoregressive and diffusion-based methods. Unlike autoregressive decoding which imposes a fixed causal order, the diffusion-based approach introduces a discrete diffusion process that enables global iterative refinement under visual conditioning.

To address the computational and structural limitations of full-attention diffusion on long documents, the authors introduce MinerU-Diffusion, a block-attention dVLM. The output sequence is partitioned into BBB contiguous blocks, factorizing the conditional posterior to allow for parallel diffusion refinement within blocks while preserving a coarse autoregressive structure across blocks. This hybrid factorization prevents long-range alignment drift while maintaining parallel efficiency. Refer to the training diagram which illustrates how the target token sequence is randomly masked to form a partially observed input, and the model predicts only the masked positions under visual and prompt conditioning. A structured attention mask is applied where tokens attend bidirectionally within each block and causally to all preceding blocks, reducing complexity from O(L2)O(L^2)O(L2) to O(BL2)O(BL'^2)O(BL′2).

The training process employs a two-stage curriculum learning framework to leverage large-scale heterogeneous data and alleviate performance bottlenecks caused by noisy labels. In Stage I, the model undergoes large-scale OCR adaptation on an easier subset of data to establish foundational structure understanding. Stage II introduces uncertainty-driven boundary refinement, where hard samples are identified via inference consistency and processed through an AI-assisted human annotation pipeline for high-precision labels. The iterative refinement capability of the model is demonstrated in the layout decoding examples, where the model progressively decodes masks to reveal the final structured text and layout tags. Similarly, the formula recognition examples show the model generating complex LaTeX expressions through multiple diffusion steps, refining the output from initial masks to the final mathematical notation.

Experiment

  • Full-document parsing experiments on OmniDocBench validate that MinerU-Diffusion achieves strong end-to-end performance without oracle layout information, though a performance gap remains when layout prediction is imperfect, highlighting layout understanding as a primary bottleneck.
  • Table and formula recognition evaluations demonstrate that the model maintains structural integrity and competitive accuracy against autoregressive baselines, with particular strength in preserving table structures during diffusive decoding.
  • Analyses of confidence thresholds and decoding parallelism reveal a controllable trade-off where lower thresholds significantly boost inference throughput while higher thresholds improve structural consistency and accuracy, with a specific threshold identified as an optimal balance point.
  • Comparisons of decoding strategies show that dynamic scheduling outperforms static approaches by adaptively selecting tokens to reduce error accumulation while maintaining higher efficiency than fixed-step methods.
  • Ablation studies on attention mechanisms confirm that Block-Attn offers superior scalability and stability compared to Full-Attn by mitigating memory costs and preventing repetitive generation artifacts common in long sequences.
  • Curriculum learning experiments validate that a two-stage training framework effectively stabilizes optimization and refines boundaries, significantly outperforming single-stage approaches in challenging settings without ground truth layout.
  • Semantic Shuffle benchmark results indicate that diffusion-based decoding relies more directly on visual signals than autoregressive models, which tend to degrade sharply when semantic coherence is removed, suggesting greater robustness to linguistic priors.

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています