Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity

The surgence of Mixture of Experts (MoE) in Large Language Models promises asmall price of execution cost for a much larger model parameter count andlearning capacity, because only a small fraction of parameters are activatedfor each input token. However, it is commonly observed that some experts areactivated far more often than others, leading to system inefficiency whenrunning the experts on different devices in parallel. Therefore, we introduceMixture of Grouped Experts (MoGE), which groups the experts during selectionand balances the expert workload better than MoE in nature. It constrainstokens to activate an equal number of experts within each predefined expertgroup. When a model execution is distributed on multiple devices, thisarchitectural design ensures a balanced computational load across devices,significantly enhancing throughput, particularly for the inference phase.Further, we build Pangu Pro MoE on Ascend NPUs, a sparse model based on MoGEwith 72 billion total parameters, 16 billion of which are activated for eachtoken. The configuration of Pangu Pro MoE is optimized for Ascend 300I Duo and800I A2 through extensive system simulation studies. Our experiments indicatethat MoGE indeed leads to better expert load balancing and more efficientexecution for both model training and inference on Ascend NPUs. The inferenceperformance of Pangu Pro MoE achieves 1148 tokens/s per card and can be furtherimproved to 1528 tokens/s per card by speculative acceleration, outperformingcomparable 32B and 72B Dense models. Furthermore, we achieve an excellentcost-to-performance ratio for model inference on Ascend 300I Duo.Our studiesshow that Ascend NPUs are capable of training Pangu Pro MoE with massiveparallelization to make it a leading model within the sub-100B total parameterclass, outperforming prominent open-source models like GLM-Z1-32B andQwen3-32B.