Microsoft Mahjong AI Paper Released, Revealing Technical Details for the First Time

Remember the "Que Shen AI" Suphx released by Microsoft in August last year? Recently, the research team published an updated paper on arXiv, further introducing the technology behind Suphx.
On August 29, 2019, Microsoft released a mahjong AI called Suphx (Super Phoenix). On a professional mahjong competition platform, Suphx's strength surpassed the average level of top human players.
Once released, Suphx attracted widespread attention, not only in the field of artificial intelligence, but also from many mahjong enthusiasts who came to watch and discuss it.(You can click this article to review "The Artificial Intelligence of the Hu Family Is Coming")

People say that the system is more complex than AlphaGo, which defeated professional Go players, and is hailed as the "strongest Japanese Mahjong artificial intelligence."
Today, the system's development team published a paper on arXiv Suphx: Mastering Mahjong with Deep Reinforcement Learning, which explains the technology behind Suphx in more depth.

Paper address: https://arxiv.org/pdf/2003.13590.pdf
Suphx is getting stronger and stronger: he has surpassed 99.99% players
As we have previously introduced, the Suphx system uses deep reinforcement learning to learn from 5,000 games and gain experience, and then defeats many mahjong players on the Japanese professional mahjong competition platform "Tenho".Obtained the highest level of ten on the platform "Te Shang Fang".

How was such a powerful Mahjong AI created? A research team from Microsoft Research Asia, Kyoto University, University of Science and Technology of China, Tsinghua University, and Nankai University gave an in-depth introduction in the latest version of the paper.
From the paper, we also learned that Suphx has improved his skills with further learning. On the "Tianfeng" platform with more than 350,000 players,Officially rated as surpassing players above 99.99%, this is the first time a computer program has surpassed most of the top human players in mahjong.
Five major models and reinforcement learning create Queshen AI
Suphx contains a series of convolutional neural networks,It learns five models to handle different scenarios.Including discard model, Riichi model, chow model, Pong model and Kong model.

On this basis, Suphx adopts anotherRule-based models,To decide whether to declare a winner and proceed to the next round, check whether the winning hand can be judged from the cards discarded by other players, or from the cards drawn from the wall.
It is reported that the training process of Suphx is divided into three steps.
First, its five models are trained using logs of top human players collected from the Tianfeng platform.
The system is then fine-tuned through self-play reinforcement learning using a CPU-based mahjong simulator and a GPU-based trajectory generation inference engine.
Finally, during online games, runtime policy tuning is used to observe the outcome of the current round and thus make the system perform better.

Since the opponent's information is unknown in the Mahjong game, Suphx triedProphet coaching technology to improve the effectiveness of reinforcement learning.During the self-game training phase, hidden information is used to guide the model training direction, thereby enhancing the AI model's understanding of visible information and finding effective decision-making basis.
Evaluation: 5760 matches, 10 records
Prior to the experiments, the team trained each model for two days using 1.5 million hands on 44 GPUs (including four Nvidia Titan XPs for parameter servers and 40 K80s for self-playing players).
The team evaluated Suphx on 20 Nvidia Tesla K80 GPUs. To reduce the variance of the stable ranking, they randomly selected 800,000 Mahjong games from a dataset of more than 1 million Mahjong games and sampled them 1,000 times.
The evaluation results show that on the "Tianfeng" platform, compared with human playersAfter playing more than 5760 games, Suphx set a record of ten sections——Only about 180 players have ever reached this level. The stable ranking is 8.74(The highest level of human players is 7.4).

Through continuous optimization, RL-2 finally achieved better performance
Interestingly, the researchers wrote that Suphx's defense was "very strong," with a low probability of 10.06%, and it developed its own playing style that allowed it to keep its cards safe and win with a half-deuce.

Give up the six-pole in the basket because it is already on the table
In addition, the co-authors of the paper wrote that most real-world problems such as financial market forecasting and logistics optimization share characteristics with Mahjong, such as complex operation/reward rules, imperfect information problems, etc.
The author believes that the Mahjong technology designed in Suphx, including global reward prediction, prophet guidance, and policy adjustment, has great potential and can be widely used in the real world in the future to help solve real and complex practical problems.
After reading this, are you eager to try it? Tianfeng Mahjong Battle Platform:https://tenhou.net/, let’s play a game together!
-- over--