HyperAIHyperAI

Command Palette

Search for a command to run...

التعلم للاستنتاج في 13 معلمة

John X. Morris Niloofar Mireshghallah Mark Ibrahim Saeed Mahloujifar

الملخص

أظهرت أبحاث حديثة أن النماذج اللغوية قادرة على تعلّم \textit{الاستدلال}، غالبًا من خلال التعلم المعزز. وتم في بعض الدراسات تدريب تمثيلات منخفضة الرتبة للاستدلال، لكن التمثيل التقليدي LoRA لا يمكنه التوسع دون الحد الأدنى من أبعاد النموذج. ونطرح سؤالًا حول ما إذا كانت حتى تمثيلات LoRA ذات الرتبة = 1 ضرورية لتعلّم الاستدلال، ونُقدّم طريقة جديدة تُسمى TinyLoRA، تتيح التوسع في المُعدّلات منخفضة الرتبة إلى أحجام تصل إلى بارامتر واحد فقط. ضمن تمثيلنا الجديد، نتمكن من تدريب نموذج Qwen2.5 الذي يحتوي على 8 مليار بارامتر، للوصول إلى دقة 91% على معيار GSM8K، باستخدام فقط 13 بارامترًا مُدرّبًا بتنسيق bf16 (أي 26 بايتًا إجماليًا). ووجدنا أن هذا الاتجاه ينطبق بشكل عام: حيث نتمكن من استعادة 90% من تحسينات الأداء، مع تدريب عدد أقل من البارامترات عبر مجموعة من معايير التعلّم على الاستدلال الأصعب، مثل AIME وAMC وMATH500. ومن الملاحظ بارزًا أننا نتمكن من تحقيق هذه الأداء العالي فقط باستخدام التعلم المعزز (RL): فالنماذج التي تم تدريبها باستخدام التدريب المخصص (SFT) تتطلب تحديثات أكبر للوصول إلى نفس المستوى من الأداء.

One-sentence Summary

Researchers from FAIR at Meta, Cornell, and CMU propose TinyLoRA, enabling reasoning in 8B-parameter Qwen2.5 with just 13 trained parameters via RL—achieving 91% GSM8K accuracy—by exploiting RL’s information-dense updates, unlike SFT, and scaling low-rank adaptation to near-zero parameter regimes.

Key Contributions

  • TinyLoRA enables effective reasoning in large language models using as few as 13 trained parameters by scaling low-rank adapters below rank=1, achieving 91% accuracy on GSM8K with Qwen2.5-8B via reinforcement learning.
  • The method demonstrates consistent efficiency across challenging benchmarks like AIME and MATH500, recovering 90% of performance gains while training 1000x fewer parameters than conventional approaches, but only when using RL—not supervised finetuning.
  • Empirical results show that large models trained with RL require dramatically smaller parameter updates to reach high performance, revealing that reasoning capabilities can be unlocked with updates under 1KB, a scale previously considered insufficient.

Introduction

The authors leverage reinforcement learning to show that large language models can learn complex reasoning tasks with astonishingly few parameters—down to just 13 trainable parameters in some cases. Prior low-rank adaptation methods like LoRA typically operate at scales of 10K to 10M parameters and struggle to scale below model dimension, limiting their efficiency for extreme parameter constraints. TinyLoRA, their proposed method, enables effective adaptation at sub-kilobyte scales by exploiting the inherent low intrinsic dimensionality of overparameterized models under RL, outperforming supervised fine-tuning which requires 100–1000x more parameters to match performance. Their work demonstrates that RL, not SFT, unlocks this extreme efficiency—especially when applied to large backbones—challenging assumptions about how much parameter update is actually needed to teach reasoning.

Top Figure

Method

The authors leverage a parameter-efficient fine-tuning framework built upon low-rank adaptation techniques, introducing TinyLoRA as a method to drastically reduce the number of trainable parameters while preserving model performance. The core idea stems from the observation that even minimal-rank adaptations like LoRA-XS still require at least one parameter per module, which becomes prohibitive when scaling across many layers and attention/MLP components in large transformer architectures.

TinyLoRA redefines the low-rank update by replacing the trainable matrix RRr×rR \in \mathbb{R}^{r \times r}RRr×r in LoRA-XS with a low-dimensional trainable vector vRu\mathbf{v} \in \mathbb{R}^{u}vRu, projected through a fixed random tensor PRu×r×rP \in \mathbb{R}^{u \times r \times r}PRu×r×r. The updated weight matrix becomes:

W=W+UΣ(i=1uviPi)VW' = W + U \Sigma \left( \sum_{i=1}^{u} v_i P_i \right) V^\topW=W+UΣ(i=1uviPi)V

where U,Σ,VU, \Sigma, VU,Σ,V are derived from the truncated SVD of the original frozen weight matrix WWW. This formulation allows each module to be adapted with only uuu trainable parameters, independent of the model width ddd or rank rrr.

To further minimize parameter count, the authors implement weight tying across modules. In standard transformer architectures such as LLaMA-3, LoRA is typically applied to seven distinct modules per layer (query, key, value, output in attention; up, down, gate in MLP). Without sharing, even u=1u=1u=1 yields 560 parameters for an 80-layer model. By tying the vector v\mathbf{v}v across all modules—either within a layer or across the entire model—the total trainable parameters scale as O(nmu/ntie)\mathcal{O}(nmu / n_{\text{tie}})O(nmu/ntie), where ntien_{\text{tie}}ntie is the number of modules sharing a single v\mathbf{v}v. With full weight tying (ntie=nmn_{\text{tie}} = nmntie=nm), the entire model can be fine-tuned with just uuu parameters—potentially as few as one.

Refer to the parameter usage comparison per layer, which illustrates how TinyLoRA reduces trainable parameters relative to LoRA and LoRA-XS under varying configurations of rank, projection dimension, and weight tying.

Experiment

  • Reinforcement learning (RL) enables dramatically smaller model updates than supervised finetuning (SFT), achieving strong math reasoning performance with as few as 13 parameters.
  • TinyLoRA, an ultra-low-rank variant, scales smoothly down to a single trained parameter and recovers 95% of full finetuning performance on GSM8K with under 100 parameters.
  • RL-based training (using GRPO) is uniquely effective in low-parameter regimes; SFT fails to match performance at comparable update sizes, indicating RL produces more information-dense updates.
  • Performance improves with model scale: larger models like Qwen-2.5-7B achieve near-full performance with fewer absolute parameters, suggesting trillion-scale models may be trainable with minimal updates.
  • Qwen models outperform LLaMA at small update sizes, possibly due to architectural or pretraining differences, requiring roughly 10x fewer parameters for equivalent gains.
  • Parameter-sharing strategies matter: tiled sharing (by depth) outperforms structured sharing (by module type), and fp32 precision yields better results than bf16/float16 despite larger size.
  • Ablations show diminishing returns with higher frozen rank; optimal TinyLoRA design favors maximizing per-module expressivity (higher u) before increasing parameter sharing (n_tie).
  • Findings are currently limited to math reasoning tasks; generalization to other domains like science or creative writing remains unverified.

The authors use reinforcement learning with TinyLoRA to finetune Qwen models on math reasoning tasks, achieving near-full-finetuning performance with as few as 13 to 196 parameters. Results show that smaller parameter updates are far more effective under RL than supervised finetuning, especially for larger models, which can reach high accuracy with minimal parameter changes. Performance scales smoothly with update size, and Qwen models consistently outperform others at low parameter counts, suggesting pretraining differences may contribute to their efficiency.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp