HyperAIHyperAI

Command Palette

Search for a command to run...

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Said Taghadouini Adrien Cavaillès Baptiste Aubertin

LightOnOCR-2-1B Lightweight, High-Performance End-to-End OCR Model

20 Hours of RTX 5090 Compute Resources for Only $1 (Worth $7)
Go to Notebook

Abstract

We present LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision-language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9x smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and LightOnOCR-bbox-bench evaluation under their respective licenses.

One-sentence Summary

LightOnOCR-2-1B is a 1B-parameter end-to-end multilingual vision-language model that converts document images into clean, naturally ordered text without brittle OCR pipelines, achieves state-of-the-art results on OlmOCR-Bench while remaining 9x smaller and substantially faster than prior best-performing models, predicts normalized bounding boxes for embedded images using RLVR with IoU-based rewards, and incorporates checkpoint averaging and task-arithmetic merging for robustness.

Key Contributions

  • We present LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision-language model that converts document images into clean text without brittle OCR pipelines. It achieves state-of-the-art results on OlmOCR-Bench while being 9x smaller and substantially faster than prior best-performing models.
  • We extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards.
  • We improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and LightOnOCR-bbox-bench evaluation under their respective licenses.

Introduction

Optical Character Recognition is critical for digitizing documents, but real-world files often contain complex layouts and scientific notation that challenge standard engines. Prior methods typically rely on brittle multi-stage pipelines requiring intermediate annotations, making adaptation to new domains costly and difficult. The authors introduce LightOnOCR-2-1B, a compact 1B-parameter end-to-end vision-language model that bypasses these pipelines to achieve state-of-the-art results. Their work incorporates large-scale distillation and reinforcement learning to ensure robustness while adding image localization capabilities through predicted bounding boxes.

Dataset

Dataset overview
Dataset overview
  • Dataset Composition and Sources

    • The authors train LightOnOCR models using large-scale document OCR corpora built primarily through distillation from strong vision-language teachers.
    • LightOnOCR-2-1B upgrades the teacher model from Qwen2-VL-72B-Instruct to Qwen3-VL-235B-A22B-Instruct to improve mathematical notation and reduce formatting artifacts.
    • The mixture combines teacher-annotated pages from PDFA sources, scanned material for robustness, and publicly available OCR datasets.
    • Additional diversity comes from document-region crops annotated with GPT-4o and TeX-derived supervision obtained by compiling raw arXiv sources with the nvpdftex pipeline.
    • Explicit blank-page examples are included to enforce consistent targets for empty inputs and mitigate looping behaviors.
  • Processing and Normalization

    • A unified normalization pipeline maps heterogeneous sources to a single canonical target format before mixing and training.
    • Text is sanitized to remove spurious Markdown ticks and watermark artifacts that could pollute deduplication procedures.
    • Special cases like full-page embedded images and blank pages are homogenized to fixed targets such as a standard image placeholder or an empty string.
    • A LaTeX conversion pass enforces formatting invariants by restricting commands to math spans and standardizing tables to HTML targets.
    • Structured metadata tracks conversion success and KaTeX compatibility to enable simple filtering rules for the final training mixture.
    • Coordinate traces observed during annotation are removed from main supervised training targets but retained as a separate signal for bounding-box procedures.
  • Data Releases and Benchmarks

    • The PDFA-derived annotated subset is released under matching license terms as lightonai/LightOnOCR-mix-0126.
    • The extracted and normalized bounding box subset is publicly released as lightonai/LightOnOCR-bbox-mix-0126.
    • arXiv bounding boxes generated by nvpdftex are used for RL supervision rather than included in the main pretraining mixture.
    • The team introduces LightOnOCR-bbox-bench to measure localization capabilities with 290 manually reviewed samples and 565 automatically annotated arXiv samples.

Method

The authors propose LightOnOCR, a compact 1B-parameter vision-language model engineered to perform OCR without relying on task prompts at inference time. The core architecture integrates a vision encoder, a multimodal projector, and a language model decoder. As illustrated in the framework diagram, the pipeline begins with the vision encoder, which utilizes a native-resolution Vision Transformer initialized from pretrained weights. This component is critical for handling variable image sizes while preserving the spatial structure necessary for documents with diverse aspect ratios and fine typographic details.

Model architecture diagram showing the flow from NaViT vision encoder through projection to the Qwen3 language model decoder.
Model architecture diagram showing the flow from NaViT vision encoder through projection to the Qwen3 language model decoder.

To bridge the vision and language modalities, the model employs a multimodal projector. This module consists of a two-layer MLP with GELU activation that projects visual features into the language model's embedding space. A key design choice involves applying spatial merging with a factor of 2 before projection. This operation effectively groups 2×22 \times 22×2 patches, reducing the number of visual tokens by 4×4\times4×. This reduction keeps the overall token count tractable for high-resolution inputs while maintaining sufficient spatial granularity. The projected visual tokens are then combined with text embeddings and fed into the language model decoder.

The decoder is initialized from a pretrained Qwen3 model and is tasked with producing a single, linearized representation of the page. To simplify the interface between modalities, the architecture removes standard image-break and image-end tokens. Instead, the decoder conditions on a single contiguous block of visual tokens followed by text tokens. This design yields a compact end-to-end vision-language model with a consistent generation format.

The training process for LightOnOCR-2 significantly updates the data and recipe compared to its predecessor. The pretraining mixture is scaled to 43M pages with stronger coverage of scanned documents and European languages. The maximum resolution is increased to 1540 pixels to improve legibility for small text and dense mathematical notation. Beyond standard transcription, the authors train an image-localization variant that predicts bounding boxes by extending the output format with normalized coordinates.

To optimize specific failure modes without extra annotation, the team applies Reinforcement Learning with Verifiable Rewards (RLVR). They utilize GRPO for training, employing automatic checks such as binary unit tests on synthetic documents for OCR and IoU-based objectives for localization. Finally, lightweight weight-space techniques are employed to combine complementary gains. The authors use checkpoint averaging by souping the last 5 checkpoints to establish a strong baseline. They further apply task-arithmetic merging to balance the trade-offs between OCR quality and bounding box accuracy without additional training.

Experiment

The evaluation assesses transcription quality on OlmOCR-Bench and localization capabilities on a dedicated bounding box benchmark using single-pass inference without test-time heuristics. Findings demonstrate that the compact model outperforms larger end-to-end systems across scientific and layout-heavy documents while retaining reliable image localization. Additionally, the system achieves substantially higher inference throughput for practical high-volume processing, although its performance is primarily optimized for printed Latin-script materials rather than handwritten or non-Latin text.

The the the table compares LightOnOCR variants against various large multimodal models on OCR benchmarks. LightOnOCR-2-1B is highlighted as a compact model that achieves competitive performance against much larger baselines. Evaluation metrics cover text, formula, and the the table recognition across English and Chinese languages. LightOnOCR-2-1B demonstrates strong performance on math and the the table-heavy documents. RLVR training is shown to improve overall performance and reduce generation failures. The model operates with a compact size compared to larger end-to-end baselines.

LightOnOCR model performance on OCR benchmarks
LightOnOCR model performance on OCR benchmarks

The the the table compares localization performance of LightOnOCR models against a larger baseline on OlmOCR and arXiv subsets. The smaller LightOnOCR-2-1B-bbox model achieves higher detection scores and count accuracy while maintaining comparable overlap metrics. Task-arithmetic merging slightly adjusts these metrics, offering a trade-off between OCR quality and localization precision. LightOnOCR-2-1B-bbox outperforms the larger baseline in detection scores and count accuracy Localization quality remains comparable between the smaller model and the larger baseline in terms of overlap metrics The bbox-soup variant demonstrates a trade-off in metrics relative to the standard bounding box model

Bounding box detection results on OlmOCR and arXiv
Bounding box detection results on OlmOCR and arXiv

The the the table compares the percentage of loopy generations between the base checkpoint and the final LightOnOCR-2-1B model. Results indicate that the final model achieves a notably lower rate of repetition loops compared to the base version. This reduction supports the claim that RLVR effectively mitigates common generation failures. The base checkpoint exhibits a higher rate of loopy generations. The final model demonstrates a reduction in repetition loops. RLVR training contributes to improved generation stability.

Comparison of generation loop rates across checkpoints
Comparison of generation loop rates across checkpoints

The authors evaluate inference efficiency by measuring pages processed per second on a single GPU. LightOnOCR-2 achieves the highest throughput despite having a significantly smaller parameter size compared to larger baselines. LightOnOCR-2 achieves the highest throughput among the listed models. The model processes pages faster than larger baselines. Smaller model size correlates with higher inference speed.

Inference throughput comparison of OCR models
Inference throughput comparison of OCR models

The authors evaluate inference efficiency by measuring throughput in pages per second. LightOnOCR-2 achieves the highest speed, outperforming larger baselines by a significant margin. This demonstrates its practicality for high-volume document processing compared to slower, larger models. LightOnOCR-2 achieves the highest throughput among all evaluated models. The model achieves top speed with a smaller parameter count than most baselines. It demonstrates a substantial speedup relative to the slowest baseline.

LightOnOCR-2 achieves highest throughput among models
LightOnOCR-2 achieves highest throughput among models

The evaluation compares LightOnOCR variants against larger multimodal baselines across OCR benchmarks, localization tasks, and inference efficiency. Results indicate that the compact LightOnOCR-2-1B model achieves competitive accuracy in text and the table recognition and superior detection scores, while RLVR training effectively reduces generation loops. Furthermore, the model delivers the highest inference throughput among tested systems, confirming its practicality for high-volume document processing while maintaining strong performance despite a smaller parameter count.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp