HyperAI超神经

A team of researchers led by Yehui Tang, Xiaosong Li, and others has introduced a novel approach called Mixture of Grouped Experts (MoGE) to enhance the efficiency of sparse models in large language models (LLMs). Traditional Mixture of Experts (MoE) architectures are known for their ability to increase model capacity with minimal computational overhead, as only a small fraction of the model parameters are activated for each input token. However, MoE suffers from an imbalance where certain experts are activated more frequently than others, leading to inefficient parallel processing across multiple devices. To address this issue, the researchers developed MoGE, which groups experts during the selection process to balance the workload more effectively. Tokens are constrained to activate an equal number of experts within each predefined group, ensuring a more uniform distribution of computational load. This design significantly boosts throughput, especially during the inference phase, by optimizing the parallel execution of experts across multiple devices. The team implemented Pangu Pro MoE, a 72 billion parameter sparse model, on Huawei’s Ascend NPUs (Neural Processing Units). Out of the total parameters, 16 billion are activated for each token, which optimizes the model for both training and inference. Extensive system simulation studies were conducted to fine-tune the configuration of Pangu Pro MoE for the Ascend 300I Duo and 800I A2 devices. These simulations revealed that MoGE improves expert load balancing and overall model execution efficiency. In terms of performance, Pangu Pro MoE achieved an inference speed of 1148 tokens per second per card. With additional speculative acceleration techniques, this figure increased to 1528 tokens per second per card, surpassing comparable dense models with 32 billion and 72 billion parameters. The cost-to-performance ratio for model inference on the Ascend 300I Duo was also found to be excellent. Moreover, the study demonstrated that Ascend NPUs are capable of training Pangu Pro MoE with significant parallelization, making it a competitive model within the sub-100 billion parameter class. The researchers compared Pangu Pro MoE against other prominent open-source models like GLM-Z1-32B and Qwen3-32B, showing that their model outperformed these benchmarks in various metrics. This research highlights the potential of MoGE to revolutionize the efficiency of sparse models in LLMs, particularly when deployed on specialized hardware like Ascend NPUs. By addressing the imbalance in expert activation, MoGE offers a more robust solution for distributed computing, which is crucial for scaling up AI models in both training and inference phases. The findings suggest that MoGE can play a pivotal role in advancing the capabilities of AI systems while maintaining economic viability. Industry insiders believe that the introduction of MoGE represents a significant advancement in the field of AI, particularly for companies looking to deploy large-scale, efficient models on custom hardware. The balanced workload distribution and superior performance of Pangu Pro MoE on Ascend NPUs highlight the growing importance of hardware-software optimization in the rapidly evolving AI landscape. SCALE AI, as a pioneer in data labeling, continues to innovate and collaborate with major players in the tech industry, reinforcing its position as a critical player in the development of advanced AI models.

Pangu Pro MoGE: New Sparse Model Architecture Enhances Efficiency and Throughput on Ascend NPUs

Related Links