HyperAIHyperAI

Command Palette

Search for a command to run...

準最適な学習率スケジュールの形状とはどのようなものか?

Hiroki Naganuma Atish Agarwala Priya Kasimbeg George E. Dahl

概要

ニューラルネットワークの訓練における基本的な未解決の問いは、与えられたワークロードに対して最適な学習率スケジュールの形状は何か、ということである。学習率スケジュールの選択は訓練プロセスの成否を左右する重要な要素だが、ウォームアップと減衰をある程度含むという以上の、良いスケジュール形状に関する合意は存在しない。この問いに答えるため、我々はパラメータ化されたスケジュール族の中で最適な形状を見つける探索手順を設計した。我々のアプローチは、スケジュール形状を基本学習率から分離し、そうしなければスケジュール間の比較を支配してしまう要因を取り除く。この探索手順を、線形回帰、CIFAR-10での画像分類、Wikitext103での小規模言語モデリングという3つのワークロードにおいて、様々なスケジュール族に適用した。その結果、我々の探索手順が一般に準最適なスケジュールを発見することを示した。ウォームアップと減衰が良いスケジュールの頑健な特徴であること、そして一般的に使用されるスケジュール族がこれらのワークロードでは最適でないことを見出した。最後に、形状探索の出力が他の最適化ハイパーパラメータにどのように依存するかを探り、重み減衰が最適なスケジュール形状に強い影響を与えうることを発見した。我々の知る限り、この結果は深層ニューラルネットワーク訓練における準最適なスケジュール形状に関する、これまでで最も包括的な結果を示すものである。

One-sentence Summary

By proposing a shape search procedure that decouples learning rate schedule shape from base learning rate, researchers from Mila, Université de Montréal, and Google DeepMind demonstrate on linear regression, CIFAR-10 image classification, and Wikitext103 language modeling that warmup and decay are robust near-optimal features, commonly used schedule families are suboptimal, and weight decay strongly influences the optimal schedule shape.

Key Contributions

  • A search procedure is proposed that factors out the base learning rate to isolate schedule shape, enabling fair comparisons across schedule families.
  • Applied to linear regression, CIFAR-10, and WikiText-103, the search finds near-optimal schedules, revealing that warmup and decay are robust features while common schedule families are suboptimal.
  • Optimal schedule shape strongly depends on weight decay, indicating that learning rate schedule optimization cannot be decoupled from regularization hyperparameter choices.

Introduction

The learning rate schedule profoundly influences neural network training speed and final performance, yet practitioners usually pick a fixed functional form, such as linear or cosine decay, and tune only a few parameters like peak value and phase durations. There is little systematic understanding of how the schedule’s shape should adapt to a given workload. The authors address this gap by defining several parameterized schedule families, including flexible spline-based curves that can mimic and extend standard shapes, and by developing a search methodology on computationally lightweight proxy tasks to discover near-optimal schedules. Their experiments show that the best schedules naturally incorporate warmup and gradual decay even when the family does not enforce them, and that hyperparameters like weight decay significantly alter the ideal shape, offering a concrete step toward workload-aware schedule design.

Method

The authorsdefine a learning rate schedule as a function s(t)=αϕ(t/T)s(t) = \alpha \cdot \phi(t/T)s(t)=αϕ(t/T), where α\alphaα is the base learning rate, TTT is the training horizon, and ϕ\phiϕ is the schedule shape. To constrain the search space, they parameterize various schedule shape families, including CONSTANT, COSINE, GENERALIZED COSINE, SQUARE-ROOT DECAY, GENERALIZED REX, TWO-POINT SPLINE, TWO-POINT LINEAR, and SMOOTH NON-MONOTONIC. Most of these families incorporate linear warmup, while the SMOOTH NON-MONOTONIC family allows for completely general two-control-point splines without guaranteed monotonic decay.

To find near-optimal schedules within these families, the authors employ a two-step search procedure. They define the optimal training loss for a parameterized shape ϕθ\phi_{\theta}ϕθ and base learning rate α\alphaα as: J(θ,α):=medianrR[min0tTLtrain(r)(θ,α,t)]\mathcal{J}(\theta, \alpha) := \underset{r \sim \mathcal{R}}{\text{median}} \left[ \min_{0 \leq t \leq T} L_{\text{train}}^{(r)}(\theta, \alpha, t) \right]J(θ,α):=rRmedian[min0tTLtrain(r)(θ,α,t)] where the median is taken over the distribution of randomness R\mathcal{R}R (e.g., weight initializations). The optimal shape parameters θ\theta^{\star}θ are found by minimizing this objective over both θ\thetaθ and α\alphaα.

The search step decouples the schedule parameters from the base learning rate. Schedule parameters are randomly sampled, and for each setting, the authors sweep over 16 base learning rates on a logarithmically spaced grid. Thousands of shapes are generated and scored using multiple PRNG seeds. Following the initial search, an evaluation step retrains the top kkk schedules with 100 seeds to compute robust median scores and confidence intervals.

The experimental results demonstrate that the base learning rate is the most critical factor for achieving good performance. Once a schedule incorporates both warmup and decay, tuning the base learning rate yields significantly larger gains than refining the specific schedule shape hyperparameters. Furthermore, the search consistently reveals that warmup and monotonic decay are fundamental features of effective learning rate schedules in deep learning, even when using flexible families like SMOOTH NON-MONOTONIC that do not enforce these properties by design.

Experiment

The evaluation compares learning rate schedule families across three small, optimization-limited workloads: synthetic linear regression, CIFAR-10 image classification, and WikiText-103 language modeling, using random search to find near-optimal shapes. The linear regression test validates the search methodology against a ground-truth optimum that has no warmup and a flat profile with a sharp late decay, while the deep learning workloads consistently require warmup followed by monotonic decay, with the base learning rate being the most critical hyperparameter. More flexible schedule families yield modest but meaningful improvements over standard cosine decay, though the Smooth Non-Monotonic family proves hard to optimize. Workload variation experiments show that weight decay strength meaningfully shifts optimal decay timing, while varying training horizon leads to gentler decay. Overall, the study confirms warmup and decay as fundamentally useful, demonstrates that task-optimal schedules differ sharply between convex and non-convex settings, and provides guidance for practical schedule tuning and search.

The study evaluated families of learning rate schedules, finding that on neural network workloads (CIFAR-10, WIKITEXT-103), schedules with warmup and flexible decay shapes yielded small but significant improvements over constant or standard cosine decay. The optimal schedule shape varied with the workload, and principles from convex optimization (linear regression) did not transfer to deep learning, where warmup proved beneficial. Among flexible options, generalized cosine captured notable gains, while two-point spline and linear families offered sufficient flexibility to approximate near-optimal schedules. On CIFAR-10 and WIKITEXT-103, learning rate warmup was beneficial across families and workload variations, contrasting with linear regression where warmup was not useful. Generalized cosine decay, with a tunable exponent, achieved significant gains over standard cosine on CIFAR-10 and outperformed a cosine variant with a non-zero final learning rate. Flexible families like two-point linear and two-point spline can capture schedules very close to optimal, but differences between top members of even more flexible families were small, suggesting diminishing returns from further complexity. The ability to change the decay shape gave small but significant improvements in both train and test metrics, encouraging consideration of schedules beyond the popular cosine decay when tuning resources are available.

When searching for learning rate schedules with AdamW on CIFAR-10, a higher momentum (β₁) during the search phase generally leads to better schedules. However, using a lower momentum at evaluation time after the schedule is found can further reduce training error, indicating that a post-search drop in momentum improves performance. Larger β₁ in the schedule selection phase consistently reduces training error when evaluated at the same β₁. For a fixed selection β₁, lowering β₁ at evaluation typically yields lower error, with the best combination being the schedule found at β₁=0.95 and evaluated at β₁=0.8.

The choice of the momentum parameter β1 during both schedule search and evaluation significantly impacts WIKITEXT-103 perplexity. Schedules discovered with higher β1 values tend to perform well when the training run also uses a similarly high β1, but performance deteriorates sharply for low-momentum evaluation, particularly at β1=0.8. The lowest perplexity is achieved when the schedule is searched and evaluated at β1=0.95. Schedules selected with higher β1 values generally yield lower perplexity when evaluated at β1 of 0.9, 0.95, or 0.975, but this trend reverses at evaluation β1=0.8, where higher selection β1 leads to worse performance. Evaluating schedules at β1=0.8 produces the widest range of perplexities and the largest standard deviations, pointing to high uncertainty in low-momentum training runs.

When choosing learning rate schedules, the momentum parameter β2 used during schedule selection strongly influences final performance. Selecting schedules with a high β2 (0.9995) and evaluating them with a low β2 (0.9) yields the lowest median training error on CIFAR-10, outperforming all matched selection–evaluation pairs. Across all evaluation β2 values, higher selection β2 consistently reduces error, indicating that tuning schedules with more momentum can uncover schedules that generalize well even with less momentum at deployment. The best combination uses selection β2=0.9995 and evaluation β2=0.9, achieving lower error than any setting where selection and evaluation β2 match. For every fixed evaluation β2, increasing the selection β2 reduces the median training error, with the largest drop occurring when moving from β2=0.9 to β2=0.9995. When schedules are selected with a low β2, error rises sharply as evaluation β2 increases, while selection with a high β2 keeps error relatively low across all evaluation conditions.

When tuning learning rate schedules, the choice of Adam's β₂ during schedule selection affects both the quality and stability of the resulting perplexity. Selecting schedules with a higher β₂ lowers the median perplexity, but the best overall performance is achieved by taking a schedule optimized with β₂=0.9995 and evaluating it with β₂=0.9. Using a low β₂ during selection introduces substantial noise, as shown by the wide spread in perplexity values across evaluation β₂. Schedule selection with higher β₂ reduces the median perplexity, while selection at β₂=0.9 produces the widest variation and large standard deviations, confirming that low β₂ during tuning adds noise. The best perplexity result emerges when a schedule selected using β₂=0.9995 is evaluated with β₂=0.9, highlighting the benefit of mismatched selection and evaluation settings.

Using AdamW on CIFAR-10 and WIKITEXT-103, the study evaluates learning rate schedule families and the impact of momentum parameters during schedule search. Flexible schedules incorporating warmup and tunable decay shapes (e.g., generalized cosine, two-point spline) yield small but significant improvements over constant or standard cosine, with the optimal shape varying by workload. Experiments on momentum reveal that selecting schedules with higher β1 and β2 values, and then evaluating at lower momenta (β1=0.8, β2=0.9), achieves the best performance, and high-β2 selection provides robustness against low-momentum noise. Overall, tuning schedule shape and exploiting momentum mismatch offers practical gains, encouraging exploration beyond standard cosine schedules when resources allow.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています