HyperAIHyperAI

Command Palette

Search for a command to run...

LightOnOCR-mix-0126 Text Transcription Dataset

Date

2 hours ago

Organization

Paper URL

2601.14251

License

Other

LightOnOCR-mix-0126 is a large-scale OCR text transcription dataset released by LightOn in 2026. The related paper is titled "LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR," which aims to provide supervision for end-to-end OCR and document understanding models, outputting naturally ordered full-page transcribed text.

This dataset consists of two parts: a training set and a validation set. Each sample corresponds to the text transcription result of a document page. The content covers page text organized in natural reading order (output formats include Markdown, LaTeX mathematical formulas, and HTML tables, etc.) and corresponding structured markup, covering various types of page content such as paragraphs, headings, lists, and tables.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp