HyperAI

In 2025, on-device AI is reaching new heights, with models like dots.ocr — a 3-billion-parameter OCR system developed by RedNote — outperforming even Google’s Gemini 2.5 Pro on the OmniDocBench. This achievement marks a turning point for on-device intelligence, proving that high-accuracy, real-time OCR can run entirely on the device without relying on cloud APIs. The key enabler is Apple’s Neural Engine, a dedicated AI accelerator present in all Apple devices since 2017, which delivers exceptional performance with minimal power consumption—up to 12x more efficient than CPU and 4x more than GPU for AI workloads. However, accessing this hardware requires using Core ML, Apple’s closed-source machine learning framework, which poses challenges for developers accustomed to open ecosystems like PyTorch. The conversion journey begins with transforming the PyTorch-based dots.ocr model into a Core ML-compatible format. The process involves two main steps: first, capturing the model’s execution graph using torch.jit.trace or the newer torch.export, and second, compiling it into an .mlpackage via coremltools. To ensure success, the team adopted a "make it work, make it right, make it fast" approach. They started by simplifying the model architecture: focusing on single-image inference instead of batch or video processing, removing complex attention mechanisms, and standardizing on the scaled dot-product attention (sdpa) operator supported by Core ML. These simplifications were crucial, as the original model used advanced features like sliding window attention and dynamic masking, which are not fully supported in Core ML. During conversion, several technical hurdles emerged. One early error stemmed from a type mismatch in a rotary positional embedding layer, where torch.arange defaulted to int32 despite the intended fp32 precision. A simple cast fixed this. Another issue occurred in repeat_interleave, which failed due to unsupported constant tensors in the context of sequence masking. Since the team was targeting single-image inference, they removed the variable-length sequence logic entirely. A third error related to in-place tensor filling with dynamic indices was resolved by replacing boolean masks with float masks and eliminating dynamic control flow—specifically, a loop over grid dimensions that was unnecessary when processing only one image. After these adjustments, the model converted successfully, with near-identical output to the original PyTorch version (max difference: ~0.006). Despite successful conversion, the model’s size—over 5GB—was prohibitive for on-device use, and inference latency exceeded one second on the vision encoder alone, rendering it impractical. This highlights the gap between model capability and deployment feasibility. The solution lies in the next phases of the series: integrating Core ML with MLX (Apple’s GPU-focused, more flexible framework) to offload the language model backbone, and applying advanced optimizations such as quantization (e.g., INT8), dynamic shape support, and memory-efficient kernels. These steps are essential for reducing size, improving speed, and unlocking full Neural Engine utilization. Industry experts see this work as a blueprint for the future of on-device AI. As Apple continues to strengthen its AI stack with hardware and software integration, developers must adapt by embracing hybrid frameworks and model simplification. Companies like RedNote are demonstrating that SOTA performance and privacy-preserving on-device inference are not mutually exclusive. With the right tools and strategies, even large models can be deployed efficiently—ushering in a new era of intelligent, private, and real-time mobile applications.

Verwandte Links

Verwandte Links

Verwandte Links

Online-Tutorial | Basierend Auf 5 Millionen Stunden Sprachdaten Erreicht Qwen3-TTS Eine 3-Sekunden-Stimmklonierung Und -Feinabstimmung.

Online-Tutorial | Basierend Auf 5 Millionen Stunden Sprachdaten Erreicht Qwen3-TTS Eine 3-Sekunden-Stimmklonierung Und -Feinabstimmung.

Command Palette

Dots.OCR mit Core ML auf Apple-Geräten optimiert

Verwandte Links

Command Palette

Dots.OCR mit Core ML auf Apple-Geräten optimiert

Verwandte Links

Command Palette

Dots.OCR mit Core ML auf Apple-Geräten optimiert

Verwandte Links

Online-Tutorial | Basierend Auf 5 Millionen Stunden Sprachdaten Erreicht Qwen3-TTS Eine 3-Sekunden-Stimmklonierung Und -Feinabstimmung.

Online-Tutorial | Basierend Auf 5 Millionen Stunden Sprachdaten Erreicht Qwen3-TTS Eine 3-Sekunden-Stimmklonierung Und -Feinabstimmung.