HyperAI超神经

Mixture of Experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to partition the problem space into homogeneous regions.

A significant advantage of mixture of experts (MoE) models is that they can be effectively pre-trained with far fewer computational resources than dense models. This means that the size of the model or dataset can be significantly increased under the same computational budget. Especially during the pre-training phase, mixture of experts models can usually reach the same quality level faster than dense models.

In the context of the Transformer model, MoE consists of two main parts:

Sparse MoE layer： Replaces the traditional dense feedforward network (FFN) layer. The MoE layer contains several "experts" (e.g. 8), each of which is an independent neural network. These experts are usually FFNs, but they can also be more complex networks, or even the MoE itself, forming a hierarchical MoE.
A gated network or router: Used to decide which tokens are assigned to which expert. For example, in the figure below, the token "More" is assigned to the second expert, while the token "Parameters" is assigned to the first network. It is worth noting that a token can be assigned to multiple experts. How to efficiently assign tokens to appropriate experts is one of the key issues to consider when using MoE technology. This router consists of a series of learnable parameters, which are pre-trained together with the rest of the model.

Image source: Switch Transformers Paper MoE layer example

The design idea of MoE (mixed expert model) is: in the Transformer model, each FFN (feedforward network) layer is replaced by a MoE layer, which consists of a gating network and several "experts".

Challenges of Mixture of Experts (MoE)

Although mixture of experts (MoE) models offer several significant advantages, such as more efficient pre-training and faster inference compared to dense models, they also come with some challenges:

Training Challenges: Although MoEs can achieve more efficient computational pre-training, they often face the problem of insufficient generalization ability in the fine-tuning stage and are prone to overfitting in the long run.
Reasoning Challenge: Although MoE models may have a large number of parameters, only a portion of them are used during inference, which makes them faster than dense models with the same number of parameters. However, such models need to load all parameters into memory, so the memory requirements are very high. Taking a MoE such as Mixtral 8x7B as an example, enough VRAM is required to accommodate a dense model with 47B parameters. The reason why it is 47B instead of 8 x 7B = 56B is that in the MoE model, only the FFN layer is treated as an independent expert, while the other parameters of the model are shared. In addition, assuming that only two experts are used per token, the inference speed (calculated in FLOPs) is similar to using a 12B model (instead of a 14B model) because although it performs 2x7B matrix multiplication calculations, some layers are shared.

References

【1】https://huggingface.co/blog/moe