HyperAI

Contains 140,000 Images! Huazhong University of Science and Technology Released a High-quality Oracle Bone Inscriptions Dataset, Helping the Team Win the ACL Best Paper Award

特色图像

People have always been exploring historical facts, and writing is undoubtedly the best mark of the survival of civilizations of all ages, and it is also a way to understand their development process. Oracle bone script (OBS) is one of the earliest known and systematic writing forms in my country, dating back about 3,000 years, carrying the culture of the Chinese nation.

In recent years, oracle bone inscriptions have been unearthed one after another, and the contents recorded in them are rich, including astronomy, meteorology, animal husbandry, religion and rituals. Similar to other ancient characters, the meanings of many oracle bone inscriptions have been lost over time. Among the 160,000 oracle bones unearthed, more than 4,600 different oracle bone inscriptions have been found, but only about 1,500 oracle bone inscriptions have their meanings and corresponding modern Chinese characters confirmed.

The task of deciphering oracle bone characters is complicated by a number of factors. Inadequate preservation and excavation methods in the past have resulted in damage to many oracle bones, which often renders the inscriptions partially blurred or illegible, making it more difficult for researchers to decipher. As a result, most images currently used in oracle bone research are denoised and processed scanned images or manually transcribed images. In addition, as an early writing system, oracle bone script has undergone significant evolution, and the forms of the characters vary greatly. Although many characters have different appearances, they correspond to the same Chinese character. This variability increases the complexity of the deciphering process.

It is not difficult to find that many factors make it challenging to fully understand oracle bone inscriptions, but even the deciphering of a single character will be of great significance to historical research.The road ahead is long and arduous, which has also aroused great interest among scholars and historians in the field of ancient Chinese studies.

3,000-year-old oracle bone inscriptions discovered by archaeologists

The emergence of artificial intelligence has provided researchers with a new way of thinking to understand this ancient language, and AI-assisted deciphering of oracle bones has become possible. However, as with the application of AI in other industries, a comprehensive and high-quality data set is essential. At present, there are high-quality data sets such as OBI-100, OBI-125, Oracle-20k, and HWOBC in the field of oracle bones, but there are still certain limitations, such as a single data source, limited categories and samples; only containing deciphered oracle bones, unable to perform deciphering tasks; poor data set quality, high noise or single form.

In response to this, Wang Pengjie and others from Professor Bai Xiang's research team at Huazhong University of Science and Technology proposed a high-quality HUST-OBC dataset.Collected from 3 different sources, including books, websites, and existing datasets. This dataset contains two types of oracle bone sample images, one is the oracle bone images obtained from the processed scans of the original oracle bone rubbings, and the other is the handwritten oracle bone images based on the original oracle bones, which are further subdivided into images based on rubbings and handwritten images based on glyphs.

Comparison of HUST-OBC with other datasets

The research, titled “An open dataset for oracle bone script recognition and decipherment”, was accepted by Scientific Data.

Paper address:

https://arxiv.org/abs/2401.15365

Download the dataset directly:

https://go.hyper.ai/46AiA

The open source project "awesome-ai4s" brings together more than 100 AI4S paper interpretations and provides massive data sets and tools:

https://github.com/hyperai/awesome-ai4s

Collect data from multiple sources and build a semi-automated production line

To build a diverse dataset, the researchers collected oracle bone images from three different sources: books, websites, and datasets.In order to organize and merge data from these different sources, as shown in the figure below, a semi-automated pipeline is used to perform four key steps: data acquisition, automatic labeling, data integration, and data verification.

Flowchart of constructing HUST-OBC dataset

Data Acquisition

The oracle bones were carved on tortoise shells and animal bones and buried underground for more than 3,000 years. These precious artifacts are scattered in museums and private collections around the world and are carefully preserved, so it is quite challenging to directly obtain the text on the original oracle bones.To overcome this difficulty, the researchers made use of oracle bone images transcribed by experts, and obtained rich and diverse oracle bone data by scanning authoritative books, crawling academic websites, and introducing data sets.

Data acquisition and processing

Automatic labeling

The collected raw data needs to be further processed, such as cropping, annotation and screening. For data from books, the existing OCR tools are difficult to accurately recognize the Chinese characters corresponding to the oracle bone inscriptions in the books because they are relatively rare.The researchers trained nearly 90,000 Chinese character OCR models to automatically identify Chinese character tags. The images from the website and database have been pre-processed and only require filtering and code matching.

Automatic Chinese Character OCR Method

Data Integration

The annotation standards of different sources may differ, resulting in the same oracle bone characters being classified into different categories, such as redundant categories caused by annotating Chinese character variants.By training the unsupervised visual contrast learning model MOCO, similar samples are merged into the same category to reduce redundant categories.

Contrastive Learning in Data Integration

Data Validation

There may be errors in the automatic data acquisition and annotation process.The researchers invited oracle scholars to conduct manual review and guidance to ensure the accuracy of the data, and finally formed the HUST-OBC dataset.

The HUST-OBC dataset that the researchers ultimately constructed contains 77,064 images of 1,588 deciphered characters and 62,989 undeciphered images, for a total of 140,053 images.The following is a display of some of the data that has been deciphered and not deciphered.

Example images of deciphered and undeciphered oracle bone inscriptions

To evaluate the quality of the dataset,The AI model was trained using this dataset, and the deciphered parts were divided into training set, validation set and test set according to 8:1:1. ResNet was used for image classification tasks, and the final classification accuracy was 94.6%, and the macro-average F1 score was 0.914. Some of the results are as follows:

Classification Metrics for Oracle Example

The team worked hard on Oracle and won the ACL Best Paper Award

Huazhong University of Science and Technology has always been at the forefront of the times in oracle bone script research and is one of the earliest universities in China to build an independent oracle bone script database. When the AI wave reshapes traditional scientific research, researchers represented by Professor Bai Xiang have once again become pioneers and pathfinders in AI-enabled oracle bone script research.

Professor Bai Xiang is currently a National Outstanding Young Scientist and IAPR Fellow. He currently serves as the Dean of the School of Software at Huazhong University of Science and Technology and the Director of the Hubei Engineering Research Center for Machine Vision and Intelligent Systems.Recently, "Deciphering Oracle Bone Language with Diffusion Models" published by Professor Bai Xiang and his team won the ACL 2024 Best Paper Award.

Based on the HUST-OBS dataset and the EVOBC dataset, this study used an image-based generative model to train a conditional diffusion model, Oracle Bone Script Decipher (OBSD), which is optimized for oracle bone script deciphering. This model uses the unseen categories of oracle bone script as conditional input to generate corresponding modern Chinese character images, providing a novel method for the ancient character recognition task that is difficult to solve in natural language processing.

Conditional Diffusion Model for Oracle Decoding

The evaluation experiment results show that the oracle bone inscriptions input through the OBSD method can produce the most accurate modern Chinese character decipherment and can discern the complex details of the oracle bone inscriptions. These results not only highlight the effectiveness of OSBD, but also its potential as an expert tool for oracle bone language decipherment.

Book draw

HyperAI and Electronics Industry Press have brought you a book giveaway! We have prepared 5 super-useful popular science books "AI for Science: Artificial Intelligence Drives Scientific Innovation". Come and participate in the lucky draw~

How to participate

Follow the HyperAI WeChat official account, reply "AI4S free book" in the background, and click on the lucky draw page to participate in the lucky draw. We have prepared 5 books for you, which will be delivered to you by express delivery. Come and participate!

Book Introduction

From predicting protein structure to inferring the pathogenicity of gene mutations, the new paradigm led by AI has allowed us to see new opportunities in various scientific fields, including life sciences.

The book "AI for Science: Artificial Intelligence Drives Scientific Innovation" focuses on the cross-integration of artificial intelligence with five major fields: materials science, life science, electronic science, energy science, and environmental science. It uses simple language to comprehensively introduce basic concepts, technical principles, and application scenarios, allowing readers to quickly master the basic knowledge of AI for Science. In addition, for each cross-field, the book provides a detailed introduction through cases, sorts out the industry map, and provides relevant policy inspiration.