HyperAI

LaTeX OCR Mathematical Formula Recognition Dataset

The LaTeX OCR dataset is a dataset that focuses on complex mathematical formula recognition problems in the field of optical character recognition (OCR). The LaTeX OCR dataset contains multiple configurations, each with different features and data partitioning. For example, the "full" configuration contains about 100k printed samples, while the "synthetic_handwrite" configuration contains 100k handwritten samples synthesized using handwritten fonts based on printed formulas.

This repository has 5 datasets:

  1. small It is a small data set with 110 samples, used for testing
  2. full This is a complete dataset of about 100k printed words. In fact, the number of samples is slightly less than 100k, because a lot of LaTeX that cannot be rendered is removed using the LaTeX abstract syntax tree.
  3. synthetic_handwrite It is a complete dataset of handwritten 100k characters, based on full The formula is synthesized using handwritten fonts, which can be regarded as human handwriting on paper. The number of samples is actually slightly less than 100k, for the same reason as above.
  4. human_handwrite It is a smaller handwriting dataset that is more consistent with human handwriting on electronic screens. It mainly comes from CROHME We have verified it using LaTeX's abstract syntax tree.
  5. human_handwrite_print Is from human_handwrite The printed data set, formula part and human_handwrite Similarly, the pictures are rendered from formulas using LaTeX.

The LaTeX OCR dataset comes from multiple sources, including https://zenodo.org/record/56198#.V2p0KTXT6eA and https://www.isical.ac.in/~crohme/ The collected data, as well as the self-constructed data, can be used to train and evaluate OCR models, especially when processing complex mathematical symbols and formulas. It has a wide range of applications in the fields of academic document digitization, online education, scientific research assistants, and personal learning.

LaTeX_OCR.torrent
Seeding 2Downloading 0Completed 95Total Downloads 124
  • LaTeX_OCR/
    • README.md
      2.29 KB
    • README.txt
      4.59 KB
      • data/
        • LaTeX_OCR.zip
          905.81 MB