Muharaf Handwritten Arabic Dataset
Date
Size
Publish URL
License
CC BY-NC-SA 3.0
Categories
* This dataset supports online use.Click here to jump.
The Muharaf dataset is a machine learning dataset focusing on handwritten Arabic recognition, created by Mehreen Saeed et al. in 2024. The related paper results are "Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition", accepted by NeurIPS 24. This dataset contains more than 1.6k images of historical handwritten pages transcribed by archival Arabic experts. Each document image is accompanied by the spatial polygon coordinates of its text lines as well as information on basic page elements. The Muharaf dataset was built to advance the state of the art in the field of handwritten text recognition (HTR), not only for Arabic manuscripts but also for connected texts.
The dataset contains a variety of writing styles and a wide range of document types, including personal letters, diaries, notes, poems, church records, and legal correspondence. In the research paper, the authors describe the data acquisition process, the salient features and statistics of the dataset, and provide preliminary baseline results obtained by training convolutional neural networks using this data.
The Muharaf dataset is divided into two parts: the public part contains 1,216 images and is distributed under the CC BY-NC-SA 4.0 license; the restricted part contains 428 images, distributed under a proprietary license, and can only be downloaded by contacting Carlos Younes at the Phoenix Center for Lebanese Studies. This part of the data is only for research purposes and redistribution is not allowed. In addition, the Muharaf dataset was created using the ScribeArabic annotation software, and the manual of the software can help users understand how it works. The image files in the dataset and the corresponding annotations, transcriptions, and tags can be viewed using the PAGE-XML viewer.
