Command Palette
Search for a command to run...
Lernen des Schließens anhand von 13 Parametern
Lernen des Schließens anhand von 13 Parametern
John X. Morris Niloofar Mireshghallah Mark Ibrahim Saeed Mahloujifar
Zusammenfassung
Neuere Forschung hat gezeigt, dass Sprachmodelle in der Lage sind, zu schließen, oft mittels Verstärkungslernen (Reinforcement Learning). Einige Ansätze trainieren sogar niedrigrangige Parametrisierungen für Schlussfolgerungen, doch herkömmliche LoRA lässt sich nicht unter die Modelldimension skalieren. Wir fragen uns, ob selbst LoRA mit Rang 1 für das Erlernen von Schlussfolgerungen notwendig ist, und stellen TinyLoRA vor, eine Methode zur Skalierung niedrigrangiger Adapter auf Größen von nur einem Parameter. In unserer neuen Parametrisierung gelingt es uns, das 8-Billionen-Parameter-Modell Qwen2.5 auf GSM8K mit nur 13 trainierbaren Parametern im bf16-Format (insgesamt 26 Byte) auf eine Genauigkeit von 91 % zu bringen. Wir stellen fest, dass dieser Trend allgemein gültig ist: Bei einer Reihe anspruchsvoller Benchmarks zum Erlernen von Schlussfolgerungen – wie AIME, AMC und MATH500 – erreichen wir 90 % der Leistungsverbesserung, während wir signifikant weniger Parameter trainieren. Besonders hervorzuheben ist, dass wir eine derart starke Leistung nur im Rahmen von Verstärkungslernen erzielen können: Modelle, die mittels SFT (Supervised Fine-Tuning) trainiert wurden, erfordern deutlich größere Updates, um die gleiche Leistung zu erreichen.
One-sentence Summary
Researchers from FAIR at Meta, Cornell, and CMU propose TinyLoRA, enabling reasoning in 8B-parameter Qwen2.5 with just 13 trained parameters via RL—achieving 91% GSM8K accuracy—by exploiting RL’s information-dense updates, unlike SFT, and scaling low-rank adaptation to near-zero parameter regimes.
Key Contributions
- TinyLoRA enables effective reasoning in large language models using as few as 13 trained parameters by scaling low-rank adapters below rank=1, achieving 91% accuracy on GSM8K with Qwen2.5-8B via reinforcement learning.
- The method demonstrates consistent efficiency across challenging benchmarks like AIME and MATH500, recovering 90% of performance gains while training 1000x fewer parameters than conventional approaches, but only when using RL—not supervised finetuning.
- Empirical results show that large models trained with RL require dramatically smaller parameter updates to reach high performance, revealing that reasoning capabilities can be unlocked with updates under 1KB, a scale previously considered insufficient.
Introduction
The authors leverage reinforcement learning to show that large language models can learn complex reasoning tasks with astonishingly few parameters—down to just 13 trainable parameters in some cases. Prior low-rank adaptation methods like LoRA typically operate at scales of 10K to 10M parameters and struggle to scale below model dimension, limiting their efficiency for extreme parameter constraints. TinyLoRA, their proposed method, enables effective adaptation at sub-kilobyte scales by exploiting the inherent low intrinsic dimensionality of overparameterized models under RL, outperforming supervised fine-tuning which requires 100–1000x more parameters to match performance. Their work demonstrates that RL, not SFT, unlocks this extreme efficiency—especially when applied to large backbones—challenging assumptions about how much parameter update is actually needed to teach reasoning.

Method
The authors leverage a parameter-efficient fine-tuning framework built upon low-rank adaptation techniques, introducing TinyLoRA as a method to drastically reduce the number of trainable parameters while preserving model performance. The core idea stems from the observation that even minimal-rank adaptations like LoRA-XS still require at least one parameter per module, which becomes prohibitive when scaling across many layers and attention/MLP components in large transformer architectures.
TinyLoRA redefines the low-rank update by replacing the trainable matrix R∈Rr×r in LoRA-XS with a low-dimensional trainable vector v∈Ru, projected through a fixed random tensor P∈Ru×r×r. The updated weight matrix becomes:
W′=W+UΣ(i=1∑uviPi)V⊤where U,Σ,V are derived from the truncated SVD of the original frozen weight matrix W. This formulation allows each module to be adapted with only u trainable parameters, independent of the model width d or rank r.
To further minimize parameter count, the authors implement weight tying across modules. In standard transformer architectures such as LLaMA-3, LoRA is typically applied to seven distinct modules per layer (query, key, value, output in attention; up, down, gate in MLP). Without sharing, even u=1 yields 560 parameters for an 80-layer model. By tying the vector v across all modules—either within a layer or across the entire model—the total trainable parameters scale as O(nmu/ntie), where ntie is the number of modules sharing a single v. With full weight tying (ntie=nm), the entire model can be fine-tuned with just u parameters—potentially as few as one.
Refer to the parameter usage comparison per layer, which illustrates how TinyLoRA reduces trainable parameters relative to LoRA and LoRA-XS under varying configurations of rank, projection dimension, and weight tying.
Experiment
- Reinforcement learning (RL) enables dramatically smaller model updates than supervised finetuning (SFT), achieving strong math reasoning performance with as few as 13 parameters.
- TinyLoRA, an ultra-low-rank variant, scales smoothly down to a single trained parameter and recovers 95% of full finetuning performance on GSM8K with under 100 parameters.
- RL-based training (using GRPO) is uniquely effective in low-parameter regimes; SFT fails to match performance at comparable update sizes, indicating RL produces more information-dense updates.
- Performance improves with model scale: larger models like Qwen-2.5-7B achieve near-full performance with fewer absolute parameters, suggesting trillion-scale models may be trainable with minimal updates.
- Qwen models outperform LLaMA at small update sizes, possibly due to architectural or pretraining differences, requiring roughly 10x fewer parameters for equivalent gains.
- Parameter-sharing strategies matter: tiled sharing (by depth) outperforms structured sharing (by module type), and fp32 precision yields better results than bf16/float16 despite larger size.
- Ablations show diminishing returns with higher frozen rank; optimal TinyLoRA design favors maximizing per-module expressivity (higher u) before increasing parameter sharing (n_tie).
- Findings are currently limited to math reasoning tasks; generalization to other domains like science or creative writing remains unverified.
The authors use reinforcement learning with TinyLoRA to finetune Qwen models on math reasoning tasks, achieving near-full-finetuning performance with as few as 13 to 196 parameters. Results show that smaller parameter updates are far more effective under RL than supervised finetuning, especially for larger models, which can reach high accuracy with minimal parameter changes. Performance scales smoothly with update size, and Qwen models consistently outperform others at low parameter counts, suggesting pretraining differences may contribute to their efficiency.
