HyperAI超神经

Artificial intelligence is intervening in the literary world again, but this time it’s used to “find authors”

For some literary works, if the creator is not certain, and the work is very old and there are no detailed historical records, the issue of the author will often become a mystery with many different opinions.

To discover the truth, future researchers need to spend a lot of energy to find information and conduct research and comparison. Even so, they often fail to obtain the most critical evidence due to some limitations.

However, with the intervention of artificial intelligence, there seems to be another way to clear the fog.

Using data science to verify the author of Dream of the Red Chamber

Regarding "Dream of the Red Chamber", it is generally believed that Cao Xueqin wrote the first eighty chapters, and Gao E compiled and continued to write the last forty chapters. Literary scholars such as Hu Shi, Yu Pingbo, and Zhou Ruchang also agree with this statement.

But there are also many different voices in the literary world. Many masters, including Lu Xun, Lin Yutang, Wang Guowei, and Pai Hsien-yung, all believe that all 120 chapters were completed by Cao Xueqin alone.

1. Statistical study published in 1980

As early as the first international "Dream of the Red Chamber" symposium in 1980, researchers used computer statistical methods to try to find out its actual author.

Mr. Chen Bingzao, a Chinese scholar from the State University of Wisconsin,He published a paper entitled "On the Authorship of A Dream of Red Mansions from the Perspective of Lexical Statistics", which attracted the attention of the international Redology community.

Chen Bingzao organized the 120 chapters of A Dream of Red Mansions into three groups of 40 chapters each, and also included another work, The Heroes of the Children of the Family, as the fourth group for comparative study.

Research on the author of Dream of the Red Chamber has been going on for hundreds of years

Choose any 80,000 words from each group.Pick out five types of words: nouns, verbs, adjectives, adverbs, and function words, these words were arranged, counted, compared and processed through the computer programs at that time, and the degree of correlation between each group was found.

The statistical results show that the positive correlation between the words used in the first eighty chapters and the last forty chapters of "Dream of Red Mansions" is 78.57%, while the positive correlation between the words used in A Dream of Red Mansions and The Heroes of the Children is 32.14%.
From this, Professor Chen Bingzao inferred that the first eighty chapters and the last forty chapters were all written by Cao Xueqin alone.

2. Research on modern SVM algorithms

But what conclusions can we draw if we use machine learning to make judgments?

In recent years, an engineer has used simple algorithmic analysis to study the authorship of Dream of the Red Chamber. He used Python tools and trained the algorithm based on the frequency of words used in the novel to distinguish the style issues of different parts.

He segmented the entire book into words and performed word frequency statistics. After finding the high-frequency words, he counted the number of times they appeared in each chapter, thereby obtaining the differences in word usage habits in different chapters.

Then we built a model using the SVM algorithm. We selected a portion of chapters from the first 80 chapters and the last 40 chapters and fed them to the model to learn the writing characteristics. We also used the remaining chapters as input to let the computer determine which part they belonged to.

The final model can make predictions with an accuracy of 95%, thusThis indirectly proves that the first 80 chapters and the last 40 chapters have obvious differences in writing style in the algorithm model and belong to different authors.

Statistics of word usage in the first 80 episodes (red) and the last 40 episodes (blue)

This project also has disadvantages.For example, too few features were selected, and only 278 words were finally selected as indicators., and the content of the training is limited to one book, which fails to explain the problem rigorously.

If the analysis of the author of "Dream of the Red Chamber" is just a trial, then a scientist's recent research on the author of the famous novel "Henry VIII" is much more accurate and rigorous.

The author of Henry VIII remains a mystery, AI takes action

Like "Dream of Red Mansions", the famous British drama literature "Henry VIII" also encountered the same problem. It is called Shakespeare's last work, but its actual author may be more than one.

Henry VIII was an extremely tyrannical monarch in history, comparable to the darker Qin Shi Huang. Between 1513 and 1547 alone, he ordered the execution of about 1,000 people. 72,000Political prisoners, evenTwo of the six wivesSent to the guillotine.

Because of the topicality and legendary nature of the character himself, there has been an endless stream of literary and film and television works about him, such as the novel and its adapted film of the same name "The Other Boleyn Girl", and the TV series "The Tudors".

**The Other Boleyn Girl tells the story of Henry VIII's cruelty and cruelty**
**Starring Scarlett Johansson as Black Widow and Natalie Portman as Black Swan**

The play "Henry VIII" was written in 1612. It is an adaptation and interpretation based on events related to Henry VIII. It has been staged many times and has received great social response.But after studying the text, many people found that its writing style was very different from Shakespeare's other works.

Some people questioned whether it was made by someone else or was a product of collaboration. It was not until 1850 that a researcher specifically pointed out that anotherPlaywright Fletcher may have been collaborator on Henry VIII.

His reasons are:A great deal of Fletcher's distinctive writing style is found in Henry VIII.

Fletcher (left) became the chief playwright of the King's Men after Shakespeare (right) retired

Over the next century, debate over the authorship continued, with some even suggesting that a third playwright, Massinger, was involved.

This mystery has become clear because of a recent study. A data scientist,Using AI algorithms, the original author of the drama "Henry VIII" was found in more detail, down to every detail in the text.

Machine learning helps determine who the real author is

Petr Plecháč, a researcher at the Czech Academy of Sciences in Prague, recently used machine learning techniques to identify the author problem in Henry VIII and achieved convincing results. His results were written into a paper and uploaded to arXiv.

**Address: https://arxiv.org/pdf/1911.05652.pdf**

In this work, Plecha used the dimension of data science to determine who wrote each part of "Henry VIII" and provided specific arguments.

He analyzed the content of textual works and identified certain characteristics of the writing styles of different authors, thereby distinguishing the works and making detailed divisions and classifications.

The algorithm ultimately attributed some chapters of Henry VIII to Shakespeare and others to Fletcher, with both contributing almost equally to the work. Not only that, the algorithm also refined the authorship of each specific section.

The first page of Henry VIII, first published in 1623

In the end, the author division given by machine learning was consistent with the views of a previous mainstream study and also achieved some breakthroughs.

Identify the source of the text by looking at its vocabulary and rhythm

How did he do this? Once you know the author's style and common words and patterns, you can use it to identify the textual habits in new works to determine whether it is from the same author.

In this study,Let the algorithm model learn and analyze common words in the text and common sentence rhythm patterns so that the algorithm can learn to identify these features.

Comprehensive analysis of sentence rhythm (rhythmic types) and common words
The model accuracy verified by other works is close to 1

Specifically, we first need to break down the script into multiple small scenes, and use support vector machines to perform attribution analysis and classification on each scene of Henry VIII.

Among them, the frequencies of the 500 most common rhythm types and the frequencies of the 500 most common words are used as the feature sets of the classifier.

Given the possible differences in the styles of authors in different periods, the researchers used scenes from other plays of the same period (such as The Tempest and Coriolanus) as training samples. Training samples were also collected for possible authors.

Finally collected 53 Shakespeare training samples, 90 Fletcher training samples, and 46 Massinger training samples.In order to estimate the accuracy of the model, cross-validation was also used to test it.

After training, the model was run on the text of Henry VIII, combining a comprehensive analysis of vocabulary and multifunctionality to determine which authors were involved in the writing of the play and their specific contributions.

The final results prove that this is a very reliable criterion for distinguishing the styles of the two authors. In particular, the combined model using common words and common rhythms has a higher accuracy rate than 96% in identifying the styles of the three authors.

The classification results of the classifier for 30 samples from different chapters are more detailed than the most authoritative author classification (the last column)

When applied to the analysis of Henry VIII, the results clearly showed that both authors were involved. Another purported playwright, Massinger, showed no involvement at the algorithmic level.

The new method refines the author of each section

To get a more reliable picture of the share of credit given to specific authors, beyond simple attribution of specific scenes, Plechach used an analytical method called rolling attribution, which determines the probability that a specific piece of text belongs to a certain author.

Rolling attribution is a technique for cases involving mixed authorship. In rolling attribution, instead of classifying the entire text or its logical parts (chapters, scenes, etc.), the classification task is performed on fixed-length overlapping sections of it.

Rolling attribution determines the composition of the authors' other works
Highly consistent with the actual situation

This method uses the concept of moving windows and combines it with standard supervised classification techniques. It aims to evaluate the style differences between discrete text samples to test the consistency of their text styles.

The results show that the rolling attribution method combined with lexical features is very reliable: the estimated accuracy of rolling attribution is as high as 0.9977 when distinguishing Shakespeare from Fletcher.

Specific to the author division and credibility of each chapter

Using this method, we were able to determine specifically the likelihood that each chapter belonged to a certain author. In the figure above, we can clearly see the chapters that Shakespeare and Fletcher each completed.The conclusion is: Shakespeare and Fletcher each completed nearly half of the content creation.

AI is gearing up for success in literature

Using AI algorithms to solve the mystery of the author of a famous work is a very valuable thing for literary researchers and enthusiasts. It also provides a data-dimensional perspective to look at such issues.

Of course, in addition to being used for author identification, ghostwriting or plagiarism judgment, similar AI methods can also be combined with technologies such as GPT-2 to generate works in a certain style, which may be able to better restore those works that have been lost in the course of history.

If borrowed from aspects such as music and painting, it can not only be used to determine the identity of the author, but also to create new works using the style of known authors.

With this in mind, it seems that the day when AI becomes a great writer may be just around the corner.