HyperAIHyperAI

Command Palette

Search for a command to run...

Apprendre à raisonner en 13 paramètres

John X. Morris Niloofar Mireshghallah Mark Ibrahim Saeed Mahloujifar

Résumé

Des recherches récentes ont montré que les modèles linguistiques peuvent apprendre à raisonner, souvent grâce à l’apprentissage par renforcement. Certaines études entraînent même des paramétrisations à faible rang dédiées au raisonnement, mais la méthode classique LoRA ne peut pas être réduite en dessous de la dimension du modèle. Nous nous interrogeons sur la nécessité d’un LoRA de rang égal à 1 pour l’apprentissage du raisonnement, et proposons TinyLoRA, une méthode permettant de réduire les adaptateurs à faible rang à des tailles aussi réduites qu’un seul paramètre. Dans notre nouvelle paramétrisation, nous parvenons à entraîner un modèle Qwen2.5 de 8 milliards de paramètres à atteindre 91 % de précision sur GSM8K, en n’entraînant que 13 paramètres (26 octets au total, en format bf16). Nous constatons que cette tendance se généralise : nous parvenons à récupérer 90 % des améliorations de performance tout en entraînant un nombre bien moindre de paramètres sur une série de benchmarks plus exigeants en apprentissage du raisonnement, tels que AIME, AMC et MATH500. Notamment, nous ne pouvons atteindre de tels résultats performants qu’avec l’apprentissage par renforcement : les modèles entraînés par simple fine-tuning supervisé (SFT) nécessitent des mises à jour bien plus importantes pour atteindre la même performance.

One-sentence Summary

Researchers from FAIR at Meta, Cornell, and CMU propose TinyLoRA, enabling reasoning in 8B-parameter Qwen2.5 with just 13 trained parameters via RL—achieving 91% GSM8K accuracy—by exploiting RL’s information-dense updates, unlike SFT, and scaling low-rank adaptation to near-zero parameter regimes.

Key Contributions

  • TinyLoRA enables effective reasoning in large language models using as few as 13 trained parameters by scaling low-rank adapters below rank=1, achieving 91% accuracy on GSM8K with Qwen2.5-8B via reinforcement learning.
  • The method demonstrates consistent efficiency across challenging benchmarks like AIME and MATH500, recovering 90% of performance gains while training 1000x fewer parameters than conventional approaches, but only when using RL—not supervised finetuning.
  • Empirical results show that large models trained with RL require dramatically smaller parameter updates to reach high performance, revealing that reasoning capabilities can be unlocked with updates under 1KB, a scale previously considered insufficient.

Introduction

The authors leverage reinforcement learning to show that large language models can learn complex reasoning tasks with astonishingly few parameters—down to just 13 trainable parameters in some cases. Prior low-rank adaptation methods like LoRA typically operate at scales of 10K to 10M parameters and struggle to scale below model dimension, limiting their efficiency for extreme parameter constraints. TinyLoRA, their proposed method, enables effective adaptation at sub-kilobyte scales by exploiting the inherent low intrinsic dimensionality of overparameterized models under RL, outperforming supervised fine-tuning which requires 100–1000x more parameters to match performance. Their work demonstrates that RL, not SFT, unlocks this extreme efficiency—especially when applied to large backbones—challenging assumptions about how much parameter update is actually needed to teach reasoning.

Top Figure

Method

The authors leverage a parameter-efficient fine-tuning framework built upon low-rank adaptation techniques, introducing TinyLoRA as a method to drastically reduce the number of trainable parameters while preserving model performance. The core idea stems from the observation that even minimal-rank adaptations like LoRA-XS still require at least one parameter per module, which becomes prohibitive when scaling across many layers and attention/MLP components in large transformer architectures.

TinyLoRA redefines the low-rank update by replacing the trainable matrix RRr×rR \in \mathbb{R}^{r \times r}RRr×r in LoRA-XS with a low-dimensional trainable vector vRu\mathbf{v} \in \mathbb{R}^{u}vRu, projected through a fixed random tensor PRu×r×rP \in \mathbb{R}^{u \times r \times r}PRu×r×r. The updated weight matrix becomes:

W=W+UΣ(i=1uviPi)VW' = W + U \Sigma \left( \sum_{i=1}^{u} v_i P_i \right) V^\topW=W+UΣ(i=1uviPi)V

where U,Σ,VU, \Sigma, VU,Σ,V are derived from the truncated SVD of the original frozen weight matrix WWW. This formulation allows each module to be adapted with only uuu trainable parameters, independent of the model width ddd or rank rrr.

To further minimize parameter count, the authors implement weight tying across modules. In standard transformer architectures such as LLaMA-3, LoRA is typically applied to seven distinct modules per layer (query, key, value, output in attention; up, down, gate in MLP). Without sharing, even u=1u=1u=1 yields 560 parameters for an 80-layer model. By tying the vector v\mathbf{v}v across all modules—either within a layer or across the entire model—the total trainable parameters scale as O(nmu/ntie)\mathcal{O}(nmu / n_{\text{tie}})O(nmu/ntie), where ntien_{\text{tie}}ntie is the number of modules sharing a single v\mathbf{v}v. With full weight tying (ntie=nmn_{\text{tie}} = nmntie=nm), the entire model can be fine-tuned with just uuu parameters—potentially as few as one.

Refer to the parameter usage comparison per layer, which illustrates how TinyLoRA reduces trainable parameters relative to LoRA and LoRA-XS under varying configurations of rank, projection dimension, and weight tying.

Experiment

  • Reinforcement learning (RL) enables dramatically smaller model updates than supervised finetuning (SFT), achieving strong math reasoning performance with as few as 13 parameters.
  • TinyLoRA, an ultra-low-rank variant, scales smoothly down to a single trained parameter and recovers 95% of full finetuning performance on GSM8K with under 100 parameters.
  • RL-based training (using GRPO) is uniquely effective in low-parameter regimes; SFT fails to match performance at comparable update sizes, indicating RL produces more information-dense updates.
  • Performance improves with model scale: larger models like Qwen-2.5-7B achieve near-full performance with fewer absolute parameters, suggesting trillion-scale models may be trainable with minimal updates.
  • Qwen models outperform LLaMA at small update sizes, possibly due to architectural or pretraining differences, requiring roughly 10x fewer parameters for equivalent gains.
  • Parameter-sharing strategies matter: tiled sharing (by depth) outperforms structured sharing (by module type), and fp32 precision yields better results than bf16/float16 despite larger size.
  • Ablations show diminishing returns with higher frozen rank; optimal TinyLoRA design favors maximizing per-module expressivity (higher u) before increasing parameter sharing (n_tie).
  • Findings are currently limited to math reasoning tasks; generalization to other domains like science or creative writing remains unverified.

The authors use reinforcement learning with TinyLoRA to finetune Qwen models on math reasoning tasks, achieving near-full-finetuning performance with as few as 13 to 196 parameters. Results show that smaller parameter updates are far more effective under RL than supervised finetuning, especially for larger models, which can reach high accuracy with minimal parameter changes. Performance scales smoothly with update size, and Qwen models consistently outperform others at low parameter counts, suggesting pretraining differences may contribute to their efficiency.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp