Command Palette
Search for a command to run...
MinerU-Diffusion: Diffusion Decoding를 통한 역 렌더링으로서의 문서 OCR 재고찰
MinerU-Diffusion: Diffusion Decoding를 통한 역 렌더링으로서의 문서 OCR 재고찰
Hejun Dong Junbo Niu Bin Wang Weijun Zeng Wentao Zhang Conghui He
초록
광학 문자 인식 (OCR) 은 줄 단위 전사에서 구조화된 문서 파싱으로 진화하여, 레이아웃, 표, 수식 등을 포함하는 장문 시퀀스를 복원할 수 있는 모델이 요구되고 있습니다. 최근 비전 - 언어 모델의 발전에도 불구하고, 대부분의 기존 시스템은 자기회귀적 (autoregressive) 디코딩에 의존하고 있어 장문 문서에서 순차적 지연을 초래하고 오류 전파를 증폭시킵니다. 본 연구에서는 역 렌더링 (inverse rendering) 관점에서 문서 OCR 을 재검토하며, 왼쪽에서 오른쪽으로의 인과적 생성은 작업의 고유한 속성이 아니라 직렬화 과정에서 파생된 부산물임을 주장합니다. 이러한 통찰에 기반하여, 우리는 자기회귀적 순차 디코딩을 시각적 조건부 하의 병렬 확산 (diffusion) 잡음 제거로 대체하는 통합 확산 기반 프레임워크인 MinerU-Diffusion 을 제안합니다. MinerU-Diffusion 은 블록 단위 확산 디코더와 불확실성 기반 커리큘럼 러닝 전략을 활용하여 안정적인 학습과 효율적인 장문 시퀀스 추론을 가능하게 합니다. 광범위한 실험 결과, MinerU-Diffusion 은 자기회귀적 베이스라인 대비 최대 3.2 배 빠른 디코딩 속도를 달성하면서도 일관되게 견고성을 향상시키는 것을 확인했습니다. 또한 제안된 Semantic Shuffle 벤치마크에 대한 평가는 이 모델이 언어적 사전 지식에 대한 의존도를 낮추고 시각적 OCR 능력을 강화했음을 추가로 입증합니다.
One-sentence Summary
Researchers from Shanghai Artificial Intelligence Laboratory and Peking University propose MinerU-Diffusion, a unified diffusion framework that replaces autoregressive decoding with block-wise parallel denoising to achieve up to 3.2× faster inference while reducing semantic hallucinations in complex document OCR tasks.
Key Contributions
- The paper introduces MinerU-Diffusion, a unified framework that reformulates document OCR as an inverse rendering problem by replacing autoregressive sequential decoding with parallel diffusion denoising under visual conditioning.
- A block-wise diffusion decoder combined with an uncertainty-driven curriculum learning strategy is employed to enable stable training and efficient inference for long document sequences while mitigating error propagation.
- Experiments on the Semantic Shuffle benchmark and full-document parsing tasks demonstrate that the method achieves up to 3.2× faster decoding than autoregressive baselines while reducing dependence on linguistic priors and improving robustness against semantic perturbations.
Introduction
Document OCR has shifted toward Vision-Language Models that parse complex layouts, tables, and formulas, yet current systems rely on autoregressive decoding which creates sequential latency and amplifies error propagation in long documents. These autoregressive approaches also force models to depend heavily on linguistic priors, leading to semantic hallucinations when visual signals are weak or document structures are disrupted. The authors leverage an inverse rendering perspective to introduce MinerU-Diffusion, a unified framework that replaces sequential generation with parallel diffusion denoising under visual conditioning. By employing a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy, this method achieves up to 3.2× faster decoding while significantly improving robustness and reducing reliance on language models for text reconstruction.
Dataset
-
Dataset Composition and Sources: The authors construct a large-scale, diverse foundational dataset called Dbase (also referred to as Dhasℓ in the text) derived entirely from the MinerU2.5 dataset. This collection contains approximately 7.5 million samples focused on Chinese and English document parsing tasks, with no dedicated evaluation performed for low-resource languages.
-
Key Details for the Subset: The data is curated to satisfy a high-entropy distribution pdiv(x) that covers diverse layouts, languages, document types, and visual styles. Although the dataset contains moderate annotation noise, its massive scale and variety are designed to support robust cross-domain generalization and stable feature learning.
-
Usage in Model Training: In Stage I, the authors use Dbase for diversity-driven foundational learning to establish robust representations and general parsing abilities. Training on this dataset yields a smooth loss landscape that facilitates stable convergence and emphasizes broad visual-semantic alignment across multiple document understanding tasks.
-
Processing and Curation: The dataset is built through a process of data curation and automated annotation refinement. The authors prioritize diversity and balance over perfect annotation quality in this initial stage to ensure the model learns stable features from a wide range of document structures.
Method
The authors model document OCR as the inverse rendering of a unified structured token sequence, where the output y encompasses text symbols, layout markers, table delimiters, and mathematical operators within a shared vocabulary V. This unified representation allows for the encoding of heterogeneous document elements, such as paragraphs, tables, and formulas, within a single sequential interface. Although serialized as a one-dimensional sequence, the underlying structure is two-dimensional, and the statistical dependencies arise primarily from spatial arrangement rather than intrinsic causal generation order. Consequently, the authors frame OCR output as a spatially coupled discrete random field. Refer to the framework diagram to see how the model maps a 2D document image to a 1D token sequence for decoding through autoregressive and diffusion-based methods. Unlike autoregressive decoding which imposes a fixed causal order, the diffusion-based approach introduces a discrete diffusion process that enables global iterative refinement under visual conditioning.
To address the computational and structural limitations of full-attention diffusion on long documents, the authors introduce MinerU-Diffusion, a block-attention dVLM. The output sequence is partitioned into B contiguous blocks, factorizing the conditional posterior to allow for parallel diffusion refinement within blocks while preserving a coarse autoregressive structure across blocks. This hybrid factorization prevents long-range alignment drift while maintaining parallel efficiency. Refer to the training diagram which illustrates how the target token sequence is randomly masked to form a partially observed input, and the model predicts only the masked positions under visual and prompt conditioning. A structured attention mask is applied where tokens attend bidirectionally within each block and causally to all preceding blocks, reducing complexity from O(L2) to O(BL′2).
The training process employs a two-stage curriculum learning framework to leverage large-scale heterogeneous data and alleviate performance bottlenecks caused by noisy labels. In Stage I, the model undergoes large-scale OCR adaptation on an easier subset of data to establish foundational structure understanding. Stage II introduces uncertainty-driven boundary refinement, where hard samples are identified via inference consistency and processed through an AI-assisted human annotation pipeline for high-precision labels. The iterative refinement capability of the model is demonstrated in the layout decoding examples, where the model progressively decodes masks to reveal the final structured text and layout tags. Similarly, the formula recognition examples show the model generating complex LaTeX expressions through multiple diffusion steps, refining the output from initial masks to the final mathematical notation.
Experiment
- Full-document parsing experiments on OmniDocBench validate that MinerU-Diffusion achieves strong end-to-end performance without oracle layout information, though a performance gap remains when layout prediction is imperfect, highlighting layout understanding as a primary bottleneck.
- Table and formula recognition evaluations demonstrate that the model maintains structural integrity and competitive accuracy against autoregressive baselines, with particular strength in preserving table structures during diffusive decoding.
- Analyses of confidence thresholds and decoding parallelism reveal a controllable trade-off where lower thresholds significantly boost inference throughput while higher thresholds improve structural consistency and accuracy, with a specific threshold identified as an optimal balance point.
- Comparisons of decoding strategies show that dynamic scheduling outperforms static approaches by adaptively selecting tokens to reduce error accumulation while maintaining higher efficiency than fixed-step methods.
- Ablation studies on attention mechanisms confirm that Block-Attn offers superior scalability and stability compared to Full-Attn by mitigating memory costs and preventing repetitive generation artifacts common in long sequences.
- Curriculum learning experiments validate that a two-stage training framework effectively stabilizes optimization and refines boundaries, significantly outperforming single-stage approaches in challenging settings without ground truth layout.
- Semantic Shuffle benchmark results indicate that diffusion-based decoding relies more directly on visual signals than autoregressive models, which tend to degrade sharply when semantic coherence is removed, suggesting greater robustness to linguistic priors.