HyperAIHyperAI

Command Palette

Search for a command to run...

il y a 7 jours
Image À Image
OCR

InkSight : Conversion d'écriture manuscrite hors ligne vers en ligne en enseignant aux modèles vision-langage à lire et écrire

Abstract

La prise de notes numérique gagne en popularité, offrant un moyen durable, modifiable et facilement indexable de stocker des notes sous une forme vectorisée, connue sous le nom d’encres numériques. Toutefois, un écart important subsiste entre cette pratique et la prise de notes traditionnelle au stylo sur papier, qui reste largement préférée par la majorité des utilisateurs. Notre travail, InkSight, vise à combler cet écart en permettant aux utilisateurs qui prennent des notes à la main de convertir facilement leurs écrits (écriture manuscrite hors ligne) en encres numériques (écriture en ligne), un processus que nous appelons dérendu. Les recherches antérieures sur ce sujet se sont concentrées sur les propriétés géométriques des images, entraînant une généralisation limitée au-delà des domaines d’entraînement. Notre approche combine des connaissances a priori liées à la lecture et à l’écriture, permettant d’entraîner un modèle sans nécessiter de grandes quantités d’échantillons appariés, difficiles à obtenir. À notre connaissance, il s’agit du premier travail à dérendre efficacement du texte manuscrit à partir de photos arbitraires présentant des caractéristiques visuelles et des arrière-plans diversifiés. En outre, il se généralise au-delà de son domaine d’entraînement pour inclure des croquis simples. Évaluation humaine montre que 87 % des échantillons générés par notre modèle sur le jeu de données exigeant HierText sont considérés comme une reproduction valide de l’image d’entrée, et 67 % semblent être des trajectoires de stylo tracées par un humain.

One-sentence Summary

The authors, from Google DeepMind and EPFL, propose InkSight, a vision-language model that derenders offline handwriting from arbitrary photos into realistic digital ink trajectories by integrating reading and writing priors, enabling generalization beyond training domains without large paired datasets, with applications in digital note-taking and sketch conversion.

Key Contributions

  • InkSight introduces the first system for derendering arbitrary photos of handwritten text into digital ink, enabling seamless conversion from offline handwriting to editable, vectorized online handwriting without requiring specialized hardware or large paired datasets.
  • The method leverages vision-language models to integrate learned reading and writing priors, allowing robust performance across diverse visual conditions and generalization to sketches, unlike prior geometric-based approaches that lack domain flexibility.
  • Human evaluation on the HierText dataset shows 87% of generated inks are valid tracings of input images, with 67% perceived as natural pen trajectories, and the model is released with publicly available data to support future research.

Introduction

The authors leverage vision-language models to address the challenge of converting offline handwriting—photos of handwritten notes—into digital ink, a process known as derendering. This capability bridges the gap between traditional pen-and-paper note-taking and modern digital workflows, enabling users to preserve the natural feel of handwriting while gaining editability, searchability, and integration with digital tools. Prior work relied heavily on geometric priors and handcrafted heuristics, limiting generalization to specific scripts, clean backgrounds, or controlled conditions, and suffered from a lack of paired training data. The main contribution is a novel, data-efficient approach that combines learned reading and writing priors through a multi-task training framework, allowing the model to infer stroke order and spatial dynamics without requiring large-scale paired datasets. The system uses a simple architecture based on ViT and mT5, processes input images via OCR-guided word segmentation, and generates high-fidelity digital ink sequences that are both semantically and geometrically accurate, as validated by human and automatic evaluations. It generalizes across diverse handwriting styles, lighting, and even simple sketches, and the authors release a public model and dataset to advance future research.

Dataset

  • The dataset comprises two main components: public and in-house collections for both OCR (text image) and digital ink (pen trajectory) data.
  • Public OCR training data includes RIMES, HierText, IMGUR5K, ICDAR'15 historical documents, and IAM, with word-level crops yielding ~295,000 Latin-script samples. In-house OCR data contains ~500,000 images, 67% handwritten, 33% printed, mostly in English.
  • Public digital ink data comes from VNOOnDB, SCUT-Couch, and DeepWriting; DeepWriting is split into character-, word-, and line-level crops, contributing to a total of ~2.7 million public ink samples. In-house digital ink data contains ~16 million samples, with Mandarin (37%) and Japanese (23%) as dominant languages.
  • All ink data undergoes normalization: resampling at 20 ms intervals, Ramer-Douglas-Peucker simplification for sequence reduction, and scaling to fit a 224×224 canvas centered at origin.
  • Ink is tokenized into discrete tokens using a dictionary of size 2N + 3 (N = 224), with separate tokens for x and y coordinates (0 to 224), plus a start-of-stroke token. Each coordinate is rounded to the nearest integer and represented as two tokens.
  • For image-based tasks, input images are scaled, centered, and padded to 224×224 with black padding. Rendered inks are generated using the Cairo library with random stroke color, background color, stroke width, and augmentations like grids, noise, and box blur.
  • Data filtering excludes samples with aspect ratios outside (0.5, 4.0) or dimensions below 25 pixels per side.
  • The model uses a mixed training setup: public model (Small-p) is trained on a combination of OCR and digital ink data with custom mixture ratios, while in-house models use larger, proprietary datasets.
  • The text vocabulary is expanded to include 2N + 3 ink-specific tokens, reducing model embedding and softmax sizes by ~80% without sacrificing performance.
  • Evaluation uses test splits from IAM, IMGUR5K, and filtered HierText (handwritten-only, ~1.3k samples) for automated derendering assessment. A small human-annotated golden set of ~200 traced samples from HierText is used for reference and human evaluation.

Method

The authors leverage a hybrid vision-language model architecture, named InkSight, designed for digital ink understanding and generation. The framework integrates a Vision Transformer (ViT) encoder with an mT5-based encoder-decoder Transformer model. The ViT encoder, initialized with pre-trained weights, processes input images to extract visual features. These features are then combined with textual inputs in the mT5 encoder-decoder component, which generates outputs in a unified token space. This space includes both standard character tokens from the mT5 vocabulary and specialized tokens for representing ink strokes, enabling the model to handle both text and ink generation tasks. During training, the ViT encoder weights are frozen, while the mT5 encoder-decoder is initialized randomly to accommodate the customized token dictionary. The overall architecture is designed to support multiple inference modes, including text-guided derendering and standalone ink generation.

The model employs a multi-task training mixture to address the scarcity of diverse paired image-ink data. This setup includes two derendering tasks, two recognition tasks, and one hybrid task, each defined by a specific input text prompt that guides the model's behavior. The tasks are designed to enable the model to generalize to real-world photos, learn priors for handling occlusions, and generate realistic ink outputs. During training, all tasks are shuffled and assigned equal probability, ensuring balanced learning across different objectives. The training mixture supports flexible inference configurations, such as Derender with Text, which uses OCR results to guide derendering, or Vanilla Derender, which operates without textual input for non-textual elements.

To bridge the domain gap between synthetic rendered ink images and real photos, data augmentation is applied to tasks that use rendered ink as input. This augmentation involves randomizing ink angle, color, stroke width, and adding Gaussian noise and cluttered backgrounds. The model processes input images through a fixed-size canvas, where each pixel is represented as a token. Ink strokes are encoded as sequences of tokens, with specific tokens marking the beginning of a stroke and subsequent sampled points along the stroke path. Text labels are also represented as tokens within the same tokenization scheme, allowing the model to jointly process and generate both ink and text outputs. This unified representation enables the model to perform tasks such as recognizing text and derendering ink simultaneously.

Experiment

  • Large-i model outperforms GVS and smaller variants on HierText, achieving higher similarity to real digital inks and better handling of diverse styles, occlusions, and complex backgrounds; on IAM, models perform similarly to GVS but show improved robustness to background noise.
  • Human evaluation on 200 HierText samples shows Large-i achieves the highest proportion of "good tracing" ratings (68%) and is perceived as more human-like than Small-p and Small-i, with performance improving as model and data scale.
  • Automated evaluation confirms model ranking aligns with human judgment: Large-i achieves the highest Character Level F1 score (0.62) on HierText and superior Exact Match Accuracy (78.4%) in online handwriting recognition compared to GVS (21.3%), demonstrating better semantic and geometric fidelity.
  • Training with recognition tasks and data augmentation significantly improves derendering quality; removing recognition tasks reduces accuracy and increases sensitivity to background noise, while unfreezing ViT leads to instability and overfitting to noise.
  • Models generalize to out-of-domain inputs such as sketches and multilingual text (e.g., Korean, French), though performance degrades on unseen scripts, with Large-i showing better cross-lingual robustness than Small-p.
  • Derendered inks from IAM train set, when used to augment real IAMOnDB data, reduce CER by 12% compared to real data alone, demonstrating effective use of offline handwriting data for training online recognizers.
  • Inference with text input (Derender with Text) improves semantic consistency, especially on ambiguous inputs, while Vanilla Derender is more robust to noise but less semantically accurate.
  • Models fail to generate strokes for text not present in the input image, preventing open-ended generation and acting as a safeguard against misuse.

The authors evaluate the performance of their models on the IAMOnDB test set using Character Error Rate (CER) as a metric. Results show that models trained on derendered inks achieve higher CER than those trained on real inks, but combining derendered and real inks significantly reduces CER, demonstrating the effectiveness of derendered inks in augmenting online handwriting recognition training.

The authors use different dropout and learning rate values for their models across datasets, with Small-i and Large-i models using a dropout of 0.25 and a learning rate of 0.001 on IAMOnDB, while Small-p models use a dropout of 0.3 and a learning rate of 0.005 on the same dataset. These settings are consistent across the models when trained on the combined IAMOnDB and IAM derendered datasets.

The authors use Table 3 to conduct an ablation study on the Small-i model, examining the impact of different inference tasks and design choices. Results show that using the Derender with Text task improves performance over other inference modes, and removing data augmentation or recognition tasks significantly degrades results across all datasets.

Results show that the Large-i model achieves the highest Character Level F1 score on the IAM and HierText datasets, while all models perform similarly on IMGUR5K. The GVS baseline outperforms the proposed models on IAM in F1 but fails to recognize text in the other datasets, highlighting the importance of text-conditioned derendering for complex backgrounds.

The authors compare the performance of an online handwriting recognizer trained on real digital inks from IAMOnDB, on derendered inks from the IAM dataset, and on a combination of both. Results show that training the recognizer on a mix of real and derendered inks achieves a lower Character Error Rate (4.6%) compared to using only real inks (6.1%) or only derendered inks (7.8%), indicating that derendered inks can effectively augment real data for training.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp