HyperAIHyperAI

Command Palette

Search for a command to run...

ReMix: LLM 미세조정에서 LoRA 혼합물을 위한 강화 기반 라우팅

초록

저랭크 어댑터 (LoRA) 는 사전 학습된 모델에 학습 가능한 저랭크 행렬을 주입하여 새로운 작업에 적응시키는 파라미터 효율적 파인튜닝 기법입니다. LoRA 의 혼합 (Mixture-of-LoRAs) 모델은 각 층의 입력을 해당 층의 소수의 전문화된 LoRA 하위 집합으로 라우팅함으로써 신경망을 효율적으로 확장합니다. 기존 LoRA 혼합 모델의 라우터는 각 LoRA 에 학습 가능한 라우팅 가중치를 할당하여 라우터의 엔드 - 투 - 엔드 학습을 가능하게 합니다. 그러나 이러한 경험적 잠재력에도 불구하고, 실제 적용 시 라우팅 가중치가 LoRA 간에 극도로 불균형하게 분포하는 현상이 관찰됩니다. 구체적으로, 하나 또는 두 개의 LoRA 만이 라우팅 가중치를 지배하는 경우가 빈번합니다. 이는 본질적으로 유효한 LoRA 의 수를 제한하여 기존 LoRA 혼합 모델의 표현력을 심각하게 저해합니다.본 연구에서는 이러한 한계를 학습 가능한 라우팅 가중치의 본질적 특성에 기인한 것으로 분석하고, 라우터의 근본적 설계를 재검토합니다. 이러한 핵심 문제를 해결하기 위해 우리는 'LoRA 혼합을 위한 강화 라우팅 (Reinforcement Routing for Mixture-of-LoRAs, ReMix)'이라 명명한 새로운 라우터를 제안합니다. 우리의 핵심 아이디어는 학습 불가능한 라우팅 가중치를 사용하여 모든 활성 LoRA 가 동등하게 효과적이도록 보장하고, 어느 특정 LoRA 가 라우팅 가중치를 지배하지 않도록 하는 것입니다. 다만, 학습 불가능한 라우팅 가중치로 인해 기존 경사 하강법을 통해 라우터를 직접 학습할 수 없습니다. 이에 우리는 강화 학습에서 감독 손실 (supervision loss) 을 보상 (reward) 으로, 라우터를 정책 (policy) 으로 간주하는 '리인포스 레이크 - 원 - 아웃 (Reinforce Leave-One-Out, RLOO)' 기법을 활용하여 라우터를 위한 편향되지 않은 경사 추정기를 제안합니다. 본 경사 추정기는 학습 연산량을 확장하여 ReMix 의 예측 성능을 향상시키는 것도 가능하게 합니다.광범위한 실험 결과, 제안된 ReMix 는 활성화된 파라미터 수를 동등하게 유지하는 조건에서 기존 최첨단 파라미터 효율적 파인튜닝 기법들을 유의미하게 능가하는 성능을 입증하였습니다.

One-sentence Summary

Researchers from the University of Illinois Urbana-Champaign and Meta AI propose ReMix, a novel Mixture-of-LoRAs framework that replaces learnable routing weights with non-learnable ones to prevent router collapse. By employing an unbiased gradient estimator and RLOO technique, ReMix ensures all active adapters contribute equally, significantly outperforming existing parameter-efficient finetuning methods.

Key Contributions

  • Existing Mixture-of-LoRAs models suffer from routing weight collapse where learned weights often concentrate on a single adapter, effectively wasting the computation of other activated LoRAs and limiting the model's expressive power.
  • The authors propose ReMix, a novel router design that enforces constant non-learnable weights across all active LoRAs to ensure equal contribution and prevent any single adapter from dominating the routing process.
  • To enable training with these non-differentiable weights, the paper introduces an unbiased gradient estimator using the reinforce leave-one-out technique, which allows ReMix to significantly outperform state-of-the-art methods across diverse benchmarks under strict parameter budgets.

Introduction

Low-rank adapters (LoRAs) enable efficient fine-tuning of large language models by injecting trainable matrices into frozen weights, while Mixture-of-LoRAs architectures aim to further boost capacity by routing inputs to specialized subsets of these adapters. However, existing approaches that rely on learned routing weights suffer from a critical flaw where weights collapse to a single dominant LoRA, effectively wasting the computational resources of the other active adapters and limiting the model's expressive power. To resolve this, the authors introduce ReMix, a reinforcement routing framework that enforces equal contribution from all active LoRAs by using non-learnable constant weights and training the router via an unbiased gradient estimator based on the REINFORCE leave-one-out technique.

Method

The authors propose ReMix, a reinforcement routing method for Mixture-of-LoRAs designed to mitigate routing weight collapse. The method fundamentally alters the adapter architecture and training procedure to ensure diverse LoRA utilization.

In the adapter architecture, the router computes a categorical distribution q(l)q^{(l)}q(l) over the available LoRAs for a given layer input. Instead of using these probabilities as continuous weights, the model selects a subset of kkk LoRAs. Crucially, the routing weights for these activated LoRAs are set to a constant value ω\omegaω, while non-activated LoRAs receive zero weight. This design guarantees that the effective support size remains fixed at kkk, preventing the router from concentrating probability mass on a single LoRA. The layer output is then computed as the sum of the frozen model output and the weighted contributions of the selected LoRAs.

To train the router parameters, the authors address the non-differentiability of the discrete selection process by framing it as a reinforcement learning problem. The SFT loss serves as the negative reward signal. During finetuning, the model samples MMM distinct selections of LoRA subsets. For each selection, the SFT loss is calculated. These losses are then used to estimate the gradient for the router parameters via the RLOO gradient estimator. This estimator leverages the variance reduction technique of using the average loss across samples as a baseline. The gradient estimator is defined as: G^P(l):=1M1m=1M(L(Jm)L)P(l)logQ(Jm)\widehat{G}_{P^{(l)}} := \frac{1}{M-1} \sum_{m=1}^{M} (\mathcal{L}(\mathfrak{J}_m) - \overline{\mathcal{L}}) \nabla_{P^{(l)}} \log Q(\mathfrak{J}_m)GP(l):=M11m=1M(L(Jm)L)P(l)logQ(Jm) where L\overline{\mathcal{L}}L represents the average SFT loss across the MMM selections.

As shown in the figure below, the framework visualizes the ReMix architecture where the router generates selection probabilities that guide the activation of specific LoRA pools across multiple layers. The process involves generating multiple selections, computing the SFT loss for each, and aggregating these signals through the RLOO gradient estimator to update the router.

During inference, the authors employ a top-kkk selection strategy. Theoretical analysis shows that if the router is sufficiently trained, selecting the kkk LoRAs with the highest probabilities guarantees the optimal subset. This deterministic approach improves upon random sampling used during training.

Experiment

  • Analysis of existing Mixture-of-LoRAs methods reveals a critical routing weight collapse where only one LoRA dominates per layer, severely limiting model expressivity and rendering other LoRAs ineffective.
  • The proposed ReMix method consistently outperforms various baselines across mathematical reasoning, code generation, and knowledge recall tasks while maintaining superior parameter efficiency.
  • Comparisons with single rank-krkrkr LoRA demonstrate that ReMix successfully activates diverse LoRA subsets rather than relying on a fixed subset, validating its ability to leverage mixture capacity.
  • Ablation studies confirm that both the RLOO training algorithm and top-kkk selection mechanism are essential components for achieving peak performance.
  • Experiments show that ReMix benefits from scaling the number of activated LoRAs and increasing training compute via sampled selections, unlike deterministic baselines which cannot utilize additional compute resources.
  • The method exhibits robustness to different routing weight initialization schemes, maintaining stable performance regardless of the specific weight configuration used.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp