HyperAI

Microsoft Mahjong AI Paper Released, Revealing Technical Details for the First Time

5 years ago
Big factory news
神经小兮
特色图像

Remember the "Que Shen AI" Suphx released by Microsoft in August last year? Recently, the research team published an updated paper on arXiv, further introducing the technology behind Suphx.

On August 29, 2019, Microsoft released a mahjong AI called Suphx (Super Phoenix). On a professional mahjong competition platform, Suphx's strength surpassed the average level of top human players.

Once released, Suphx attracted widespread attention, not only in the field of artificial intelligence, but also from many mahjong enthusiasts who came to watch and discuss it.(You can click this article to review "The Artificial Intelligence of the Hu Family Is Coming")

The number and average size of information sets of Mahjong exceed those of Bridge, Texas Hold'em, and Go.

People say that the system is more complex than AlphaGo, which defeated professional Go players, and is hailed as the "strongest Japanese Mahjong artificial intelligence."

Today, the system's development team published a paper on arXiv Suphx: Mastering Mahjong with Deep Reinforcement Learning, which explains the technology behind Suphx in more depth.

Suphx: Mastering Mahjong with Deep Reinforcement Learning
Paper address: https://arxiv.org/pdf/2003.13590.pdf

Suphx is getting stronger and stronger: he has surpassed 99.99% players

As we have previously introduced, the Suphx system uses deep reinforcement learning to learn from 5,000 games and gain experience, and then defeats many mahjong players on the Japanese professional mahjong competition platform "Tenho".Obtained the highest level of ten on the platform "Te Shang Fang".

Suphx's rank on the Tianfeng platform is much higher than other Mahjong AI

How was such a powerful Mahjong AI created? A research team from Microsoft Research Asia, Kyoto University, University of Science and Technology of China, Tsinghua University, and Nankai University gave an in-depth introduction in the latest version of the paper.

From the paper, we also learned that Suphx has improved his skills with further learning. On the "Tianfeng" platform with more than 350,000 players,Officially rated as surpassing players above 99.99%, this is the first time a computer program has surpassed most of the top human players in mahjong.

Five major models and reinforcement learning create Queshen AI

Suphx contains a series of convolutional neural networks,It learns five models to handle different scenarios.Including discard model, Riichi model, chow model, Pong model and Kong model.

The discard model (top) and the architecture of the other four models (bottom)

On this basis, Suphx adopts anotherRule-based models,To decide whether to declare a winner and proceed to the next round, check whether the winning hand can be judged from the cards discarded by other players, or from the cards drawn from the wall.

It is reported that the training process of Suphx is divided into three steps.

First, its five models are trained using logs of top human players collected from the Tianfeng platform.

The system is then fine-tuned through self-play reinforcement learning using a CPU-based mahjong simulator and a GPU-based trajectory generation inference engine.

Finally, during online games, runtime policy tuning is used to observe the outcome of the current round and thus make the system perform better.

Distributed reinforcement learning system in Suphx

Since the opponent's information is unknown in the Mahjong game, Suphx triedProphet coaching technology to improve the effectiveness of reinforcement learning.During the self-game training phase, hidden information is used to guide the model training direction, thereby enhancing the AI model's understanding of visible information and finding effective decision-making basis.

Evaluation: 5760 matches, 10 records

Prior to the experiments, the team trained each model for two days using 1.5 million hands on 44 GPUs (including four Nvidia Titan XPs for parameter servers and 40 K80s for self-playing players).

The team evaluated Suphx on 20 Nvidia Tesla K80 GPUs. To reduce the variance of the stable ranking, they randomly selected 800,000 Mahjong games from a dataset of more than 1 million Mahjong games and sampled them 1,000 times.

The evaluation results show that on the "Tianfeng" platform, compared with human playersAfter playing more than 5760 games, Suphx set a record of ten sections——Only about 180 players have ever reached this level.  The stable ranking is 8.74(The highest level of human players is 7.4).

Reinforcement learning agent final stable ranking statistics
Through continuous optimization, RL-2 finally achieved better performance 

Interestingly, the researchers wrote that Suphx's defense was "very strong," with a low probability of 10.06%, and it developed its own playing style that allowed it to keep its cards safe and win with a half-deuce.

AI players (South) will choose to play conservatively
Give up the six-pole in the basket because it is already on the table

In addition, the co-authors of the paper wrote that most real-world problems such as financial market forecasting and logistics optimization share characteristics with Mahjong, such as complex operation/reward rules, imperfect information problems, etc.

The author believes that the Mahjong technology designed in Suphx, including global reward prediction, prophet guidance, and policy adjustment, has great potential and can be widely used in the real world in the future to help solve real and complex practical problems.

After reading this, are you eager to try it? Tianfeng Mahjong Battle Platform:https://tenhou.net/, let’s play a game together!

-- over--