HyperAIHyperAI

Command Palette

Search for a command to run...

Mixture-of-Experts에서 보조 손실을 통한 전문가와 라우터의 결합

Ang Lv Jin Ma Yiyuan Ma Siyuan Qiao

초록

Mixture-of-Experts (MoE) 모델은 라우터의 결정이 전문가(expert)의 능력과 잘 일치하도록 보장하는 명시적 제약이 부족하여, 최종적으로 모델 성능에 한계를 겪는다. 이를 해결하기 위해 우리는 라우터-전문가 결합(Expert-Router Coupling, ERC) 손실을 제안한다. 이는 라우터의 결정과 전문가의 능력 간을 강하게 결합하는 가벼운 보조 손실이다. 본 방법은 각 전문가의 라우터 임베딩을 그 전문가에게 할당된 토큰들에 대한 대표 토큰(proxy token)으로 간주하고, 변형된 라우터 임베딩을 전문가를 통해 전파하여 내부 활성화를 얻는다. ERC 손실은 이러한 활성화에 대해 두 가지 제약을 강제한다: (1) 각 전문가는 자신에 해당하는 대표 토큰에 대해 다른 전문가의 대표 토큰보다 더 높은 활성화를 보여야 한다. (2) 각 대표 토큰은 해당 전문가에 대해 다른 전문가보다 더 강한 활성화를 유도해야 한다. 이러한 제약은 라우터 임베딩이 각각의 전문가 능력을 정확히 대표하도록 보장하며, 동시에 각 전문가가 실제로 라우팅된 토큰들만을 전문적으로 처리하도록 한다. ERC 손실은 계산적으로 효율적이며, 전문가 수 n에 대해 n²개의 활성화만을 대상으로 작동하므로, 배치 크기와 무관한 고정된 비용을 갖는다. 이는 이전의 결합 방법과 달리 토큰 수(일반적으로 배치당 수백만 개)에 따라 증가하는 비용을 가지지 않는다. 3B에서 15B 파라미터 규모의 사전 훈련된 MoE-LLM을 대상으로 수조 개의 토큰에 대한 광범위한 분석을 수행한 결과, ERC 손실의 효과를 입증하였다. 더불어 ERC 손실은 훈련 중 전문가의 특화 수준에 대해 유연한 제어와 정량적 추적을 가능하게 하며, MoE 모델의 동작에 대한 귀중한 통찰을 제공한다.

One-sentence Summary

The authors from Renmin University of China and ByteDance Seed propose a lightweight expert-router coupling (ERC) loss that enforces alignment between router decisions and expert capabilities by using perturbed router embeddings as proxy tokens, ensuring each expert specializes in its assigned tokens through dual activation constraints—outperforming prior methods in computational efficiency and enabling fine-grained tracking of expert specialization in MoE-LLMs up to 15B parameters.

Key Contributions

  • Mixture-of-Experts (MoE) models suffer from weak alignment between router decisions and expert capabilities, leading to suboptimal token routing and hindered specialization, which limits overall performance despite their efficiency advantages.
  • The proposed expert-router coupling (ERC) loss introduces a lightweight, n2n^2n2-cost auxiliary loss that enforces two key constraints: each expert must activate more strongly on its own proxy token (derived from perturbed router embeddings) than on others, and each proxy token must activate its corresponding expert most strongly, thereby tightly coupling router representations with expert capabilities.
  • Extensive pre-training on 3B to 15B parameter MoE-LLMs using trillions of tokens demonstrates that ERC loss improves downstream performance while maintaining low training overhead, and enables quantitative tracking and flexible control of expert specialization levels during training.

Introduction

The authors address a key limitation in Mixture-of-Experts (MoE) language models: the weak coupling between router decisions and expert capabilities, which can lead to suboptimal expert utilization and hinder model performance. This decoupling often results in poor specialization and inefficient resource allocation during inference. To overcome this, the authors introduce an expert-router coupling (ERC) loss that tightly aligns router parameters with their corresponding experts during training. The ERC loss enhances downstream task performance with minimal additional training cost and provides deeper insights into expert specialization, offering a valuable tool for future MoE model research.

Method

The authors leverage a novel auxiliary loss, termed expert-router coupling (ERC) loss, to address the lack of explicit constraints ensuring alignment between router decisions and expert capabilities in Mixture-of-Experts (MoE) models. The core of the ERC loss is a three-step process that operates on the router's parameter matrix, treating each row as a cluster center representing a token cluster routed to a specific expert. This framework is illustrated in the accompanying diagram.

The first step involves generating a perturbed proxy token for each expert. Specifically, each router parameter vector R[i]R[i]R[i] is augmented with bounded random noise δi\delta_iδi to produce a proxy token R~[i]=R[i]δi\tilde{R}[i] = R[i] \odot \delta_iR~[i]=R[i]δi. This noise is modeled as a multiplicative uniform distribution, ensuring the proxy token generalizes to the tokens assigned to its corresponding expert while remaining within the same cluster. The second step processes each of these nnn proxy tokens through all nnn experts. The intermediate activation norm from each expert jjj given input R~[i]\tilde{R}[i]R~[i] is computed, forming an n×nn \times nn×n matrix MMM, where M[i,j]=R~[i]WgjM[i,j] = \| \tilde{R}[i] \cdot W_g^j \|M[i,j]=R~[i]Wgj. This step is designed to be computationally efficient, operating on n2n^2n2 activations, which is independent of the batch size.

The third and final step enforces expert-router coupling by applying two constraints to the matrix MMM. For all iji \neq ji=j, the loss penalizes cases where the activation norm from expert jjj to proxy iii exceeds a scaled version of the activation norm from expert iii to its own proxy, and vice versa. This is formalized as M[i,j]<αM[i,i]M[i,j] < \alpha M[i,i]M[i,j]<αM[i,i] and M[j,i]<αM[i,i]M[j,i] < \alpha M[i,i]M[j,i]<αM[i,i], where α\alphaα is a scalar hyperparameter. The overall ERC loss is the mean of the positive parts of these violations, defined as:

LERC=1n2i=1njin(max(M[i,j]αM[i,i],0)+max(M[j,i]αM[i,i],0)).\mathcal{L}_{\mathrm{ERC}} = \frac{1}{n^2} \sum_{i=1}^{n} \sum_{j \neq i}^{n} ( \max(M[i,j] - \alpha M[i,i], 0) + \max(M[j,i] - \alpha M[i,i], 0) ).LERC=n21i=1nj=in(max(M[i,j]αM[i,i],0)+max(M[j,i]αM[i,i],0)).

Minimizing this loss ensures that each expert exhibits its highest activation for its own proxy token (promoting expert specialization) and that each proxy token elicits its strongest activation from its corresponding expert (ensuring precise token routing). The ERC loss is designed to be lightweight, with a fixed computational cost of 2n2Dd2n^2Dd2n2Dd FLOPs, and does not introduce activation density beyond that of a vanilla MoE, making it a practical and efficient enhancement. The authors also demonstrate that the ERC loss provides a quantitative measure of expert specialization, as the hyperparameter α\alphaα directly controls the degree of specialization.

Experiment

  • ERC-loss-augmented MoE outperforms vanilla MoE and narrows the gap with AoE on multiple benchmarks, achieving significant and stable gains across tasks including ARC-Challenge, CommonsenseQA, MMLU, and others, with consistent improvements on both 3B and 15B parameter models.
  • On the 3B model, ERC loss achieves comparable load balancing to vanilla MoE (difference ~10⁻⁵) and maintains near-identical training throughput and memory usage, while AoE incurs 1.6× higher training time and 1.3× higher memory usage, making it impractical for scaling.
  • The ERC loss introduces negligible overhead—0.2–0.8% in real-world distributed training—due to its low FLOP cost (0.18–0.72% of base forward pass), confirmed by both theoretical analysis and empirical throughput measurements.
  • ERC loss enables effective expert specialization, as shown by t-SNE visualizations and quantitative metrics: increased clustering in expert parameters and a measurable correlation between the noise level ε and specialization degree controlled by α.
  • Ablation studies confirm that the random noise δ in the ERC loss is critical for generalization, and the loss cannot be replaced by separate constraints on routers or experts (e.g., router orthogonality), which yield limited gains even when router embeddings are already nearly orthogonal.
  • The optimal specialization level is not extreme; performance degrades with overly strict α, indicating a trade-off between specialization and collaboration, with optimal α depending on model scale (e.g., α=1 for n=64, α=0.5 for n=256).
  • ERC loss is effective at scale: on 15B models with n=256 and K=8, it improves performance across challenging benchmarks including MMLU-Pro, AGI-Eval, MATH, and GSM8K, despite AoE failing to train due to excessive cost.

The authors use the ERC loss to strengthen the coupling between routers and experts in a MoE model, and the table shows that this results in a significant reduction of the ERC loss across all layers, with values dropping to 0.00 when the loss is applied. This indicates that the model learns to align router and expert parameters effectively, as evidenced by the near-zero ERC loss in the +LERC column, while the baseline values remain non-zero.

The authors use the ERC loss to investigate expert specialization by varying the coupling strength parameter α, and the table shows that as α increases, the ERC loss decreases across all layers, indicating reduced specialization. This trend is consistent with the analysis that higher α values weaken the coupling constraint, leading to more homogeneous experts and lower performance gains.

The authors use the ERC loss to enhance expert-router coupling in MoE models, resulting in consistent performance improvements across multiple benchmarks. Results show that the MoE model augmented with ERC loss achieves higher accuracy than the vanilla MoE baseline, with gains observed in both 3B and 15B parameter models, while maintaining low computational overhead and effective load balancing.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
Mixture-of-Experts에서 보조 손실을 통한 전문가와 라우터의 결합 | 문서 | HyperAI초신경