LaTeX OCR Mathematical Formula Recognition Dataset
Date
Size
Publish URL
Categories
The LaTeX OCR dataset is a dataset that focuses on complex mathematical formula recognition problems in the field of optical character recognition (OCR). The LaTeX OCR dataset contains multiple configurations, each with different features and data partitioning. For example, the "full" configuration contains about 100k printed samples, while the "synthetic_handwrite" configuration contains 100k handwritten samples synthesized using handwritten fonts based on printed formulas.
This repository has 5 datasets:
small
It is a small data set with 110 samples, used for testingfull
This is a complete dataset of about 100k printed words. In fact, the number of samples is slightly less than 100k, because a lot of LaTeX that cannot be rendered is removed using the LaTeX abstract syntax tree.synthetic_handwrite
It is a complete dataset of handwritten 100k characters, based onfull
The formula is synthesized using handwritten fonts, which can be regarded as human handwriting on paper. The number of samples is actually slightly less than 100k, for the same reason as above.human_handwrite
It is a smaller handwriting dataset that is more consistent with human handwriting on electronic screens. It mainly comes fromCROHME
We have verified it using LaTeX's abstract syntax tree.human_handwrite_print
Is fromhuman_handwrite
The printed data set, formula part andhuman_handwrite
Similarly, the pictures are rendered from formulas using LaTeX.
The LaTeX OCR dataset comes from multiple sources, including https://zenodo.org/record/56198#.V2p0KTXT6eA
and https://www.isical.ac.in/~crohme/
The collected data, as well as the self-constructed data, can be used to train and evaluate OCR models, especially when processing complex mathematical symbols and formulas. It has a wide range of applications in the fields of academic document digitization, online education, scientific research assistants, and personal learning.