HyperAI

Every year, Apple’s hardware becomes more powerful and models grow smarter per parameter. By 2025, running highly competitive models directly on devices is more realistic than ever. dots.ocr, a 3-billion-parameter OCR model developed by RedNote, has outperformed Gemini 2.5 Pro on OmniDocBench, proving that on-device OCR can deliver top-tier results without compromise. Running models locally offers major advantages—no API keys to manage, zero cost, and no reliance on network connectivity. However, these benefits come with constraints: limited compute and strict power budgets. Apple’s Neural Engine, present in every Apple device since 2017, is engineered for high performance with minimal power consumption. Testing shows it can be up to 12 times more power efficient than the CPU and four times more efficient than the GPU. The challenge? The Neural Engine is only accessible through Core ML, Apple’s closed-source machine learning framework. Converting models from PyTorch to Core ML can be difficult, especially without pre-converted models or deep knowledge of the framework’s quirks. Fortunately, Apple’s MLX framework offers a more flexible alternative, designed for GPU acceleration and compatible with Core ML workflows. This three-part series walks through the process of adapting dots.ocr for on-device execution using a hybrid approach: Core ML for the vision encoder and MLX for the language model backbone. The techniques outlined here are broadly applicable and aim to guide developers through the complexities of on-device AI deployment. To follow along, clone the repository and run the setup script using uv and huggingface-cli. For those who just want the converted model, it’s available for download. The conversion process involves two main steps: first, capturing the PyTorch execution graph using torch.jit.trace or torch.export, then compiling it into an .mlpackage using coremltools. Most of the control lies in the first step, where the structure of the traced graph determines success. dots.ocr consists of two parts: a 1.2-billion-parameter vision encoder based on NaViT, trained from scratch, and a Qwen2.5-1.5B language model. The plan is to run the vision encoder via Core ML and the LM via MLX. Before conversion, it’s essential to understand and simplify the model. The original vision encoder supports videos and image batches, but for on-device use, processing a single image is sufficient. This simplification reduces complexity and improves compatibility. Next, focus on removing unnecessary components. The model includes multiple attention implementations, but Core ML works best with scaled_dot_product_attention (sdpa). Switching to the standard sdpa (non-memory-efficient) version eliminates unsupported ops. A warning about Sliding Window Attention appears, but since it’s not required, it can be safely ignored. The first conversion attempt using torch.jit.trace hits a roadblock: a dtype mismatch in a matmul operation where one tensor is int32 and the other is float32. The issue stems from torch.arange not respecting dtype hints in tracing. Adding an explicit cast fixes it. The next error occurs in repeat_interleave: “Cannot add const [None]”. This arises from code handling variable-length sequences via grid_thw. Since we’re only processing one image, this logic is unnecessary. Removing the repeat_interleave call resolves the issue. A third error involves _internal_op_tensor_inplace_fill, which doesn’t support dynamic indexing. This is again tied to sequence masking. Since we’re processing a single image, a fixed mask of all True (converted to float zeros for Neural Engine compatibility) suffices. Finally, a reshape error appears: “the result shape is not compatible with the input shape.” This traces back to a loop iterating over grid_thw, which introduces dynamic control flow—unsupported by many ML compilers. Since we’re processing one image, the loop can be replaced with a direct assignment of the single H, W value. After these changes, the model converts successfully and matches PyTorch output with minimal error (max difference: ~0.006, mean: ~1.1e-5). Benchmarking reveals a major issue: the model is over 5GB in size and takes more than a second to run a single forward pass on the vision encoder—far too slow for real-time use. In the next part, we’ll integrate Core ML and MLX to run the full pipeline on-device. The final part will focus on optimizations like quantization and dynamic shape support to enable efficient execution on the Neural Engine.

Related Links

Related Links

Related Links

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

Command Palette

Converting dots.OCR to On-Device: A Guide to Core ML and MLX Integration for SOTA OCR on Apple Devices

Related Links

Command Palette

Converting dots.OCR to On-Device: A Guide to Core ML and MLX Integration for SOTA OCR on Apple Devices

Related Links

Command Palette

Converting dots.OCR to On-Device: A Guide to Core ML and MLX Integration for SOTA OCR on Apple Devices

Related Links

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.