Baidu Releases Unlimited OCR for One-shot Long-horizon Parsing
Baidu has released Unlimited-OCR, an open-source artificial intelligence model designed to advance document parsing through one-shot long-horizon optical character recognition capabilities. The platform addresses traditional limitations in text extraction by enabling accurate processing across single images, multi-page layouts, and complex PDF documents within a single inference cycle. The system operates using two distinct architectural configurations optimized for different workloads. The gundam mode streamlines single-image analysis with a 1024-pixel base resolution and dynamic cropping, while the base mode maintains full 1024-pixel resolution for comprehensive multi-page and document parsing. Technical benchmarks confirm stable deployment on Python 3.12.3 environments utilizing CUDA 12.9 and NVIDIA GPU acceleration. Enterprises and developers can integrate the model through two primary inference pathways. The HuggingFace Transformers framework enables direct model initialization with safe tensor loading and bfloat16 precision, featuring configurable parameters for maximum sequence length, n-gram repetition control, and adaptive window sizing. Alternatively, the SGLang deployment engine provides high-throughput batch processing alongside an OpenAI-compatible streaming API. This server architecture supports concurrent document handling, automated PDF page rendering via PyMuPDF, and real-time token generation tailored for enterprise-scale digitization workflows. The architecture incorporates specialized logit processors to mitigate repetitive output patterns during extended parsing tasks, resolving a prevalent failure mode in transformer-based OCR systems. By supporting output sequences up to 32,768 tokens and customizable repetition parameters, the model preserves structural fidelity across lengthy technical manuals, legal archives, and academic publications. Baidu positions Unlimited-OCR as a foundational component for modern information retrieval systems, automated data extraction pipelines, and digital archive management. The open-source release aligns with industry momentum toward unified multimodal parsing frameworks, eliminating the need for fragmented legacy OCR stacks. Model weights and technical documentation are publicly available on GitHub, with development contributions acknowledged from Deepseek-OCR, Deepseek-OCR-2, and PaddleOCR initiatives.
