Brothers Develop Open-Source AI Tool to Significantly Reduce Training Costs
In the autumn of 2023, while the world was captivated by the rise of ChatGPT and large language models, two brothers in Sydney, Australia, were grappling with a different problem: why did fine-tuning an open-source model take so long and require such expensive GPUs? Daniel Han, a graduate of the University of New South Wales and former NVIDIA engineer specializing in algorithm optimization, stared at a sluggish training progress bar on his screen. A single Google Colab T4 GPU couldn’t handle a 13-billion-parameter model—memory overflow was inevitable. Commercial solutions demanded thousands of dollars in high-end hardware. Determined to find a better way, Daniel discussed the issue with his younger brother, Michael Han-Chen. The result? A project that would soon challenge the dominance of AI training infrastructure: Unsloth. Daniel’s career at NVIDIA was defined by performance optimization. He accelerated the TSNE algorithm by 2,000 times, optimized randomized singular value decomposition (SVD), and maintained Hyperlearn, a machine learning toolkit used by NASA and Microsoft engineers. These experiences revealed a key truth: much of the performance bottleneck in AI software isn’t due to hardware limitations, but to the inefficiencies of general-purpose frameworks like PyTorch and TensorFlow. Built for broad compatibility, these tools sacrifice speed and memory efficiency. But when tailored to specific use cases, massive gains are possible. For Daniel, the motivation went beyond performance—it was about democratizing AI. “OpenAI and Anthropic are betting on bigger models, more data, and more compute to reach AGI,” he said. “We believe we can get there faster, with smarter algorithms, less energy, and fewer resources—so that AGI can be accessible to everyone.” In October 2023, the brothers entered the LLM Efficiency Challenge in Europe, a 24-hour competition to train a language model on a single GPU with the highest accuracy. Instead of chasing accuracy, they focused on speed. Using only free Colab and Kaggle GPUs, they optimized the training pipeline, achieving a 2x speedup and cutting memory usage by 50%—without any loss in model quality. This breakthrough, born from a side project, was released in December 2023 as Unsloth—short for “unslothing,” a nod to making AI training no longer slow and sluggish. No marketing budget. No big team. Just code on GitHub and a post on Reddit. Within the first week, thousands of developers tried it. Skepticism was immediate: “How can you be 2x faster with no accuracy loss?” Daniel’s answer was simple: show the math. He published detailed blog posts explaining hand-derived backpropagation, shared Triton kernel source code, and released full performance logs. Developers began reading, testing, and verifying. The results held. Unsloth’s reputation soared in March 2024 when the team uncovered a series of critical bugs in Google’s Gemma model. After release, Gemma showed unstable training—losses wouldn’t converge, fine-tuning failed. The community speculated, but no one could pinpoint the issue. Daniel found not one bug, but eight: flawed tokenizers, incorrect position encoding, and subtle numerical precision errors. He spent three days documenting each flaw, including mathematical derivations, test results, and fixes—then published everything openly. Within hours, the post spread across forums. Andrej Karpathy shared it, commenting: “This is the value of deeply understanding every layer of the deep learning stack.” Google confirmed the bugs, adopted the fixes, and credited Unsloth in their update notes. This pattern repeated over the next year. Unsloth quickly analyzed Meta’s Llama 3, Microsoft’s Phi-4, Alibaba’s Qwen 2.5, and more—identifying and resolving issues before they became widespread. In October 2024, they discovered a fundamental bug in gradient accumulation across all training frameworks, which was later merged into Hugging Face Transformers, benefiting millions. “When we integrate a new model and find our implementation outperforms the official version, we know something’s wrong,” Daniel explained. This relentless attention to detail and commitment to open collaboration earned Unsloth deep respect in the community. Hugging Face now officially recommends Unsloth for performance and memory efficiency. AWS, Intel, and other major companies have reached out to port the framework to their hardware. At the heart of Unsloth’s success is a complete rethinking of the training pipeline. While most developers rely on PyTorch’s autograd, Daniel took a different path: manually deriving matrix derivatives for performance-critical operations. For example, in combining attention mechanisms with LoRA (Low-Rank Adaptation), the standard approach requires three matrix multiplications and intermediate storage. Unsloth optimizes this to a single multiplication: output = X × (W + A × B), reducing both computation and memory usage. This algebraic trick alone delivers 4–6% speed gains and drastically cuts GPU memory. Because LoRA matrices are small (8–128 dimensions), while model weights are large (e.g., 4096), rearranging operations avoids redundant calculations—saving orders of magnitude in floating-point operations. The team also rewrote core kernels in Triton—OpenAI’s low-level GPU programming language—for RoPE position encoding, RMS layer normalization, and cross-entropy loss. These handcrafted kernels are faster and more readable than standard implementations. Another breakthrough is “dynamic quantization”—instead of uniformly compressing all layers to 4-bit, Unsloth identifies sensitive layers and preserves higher precision, minimizing accuracy loss while maximizing memory savings. But the real game-changer is memory reduction. “70–80% of our memory savings matter most,” Daniel emphasizes. As models grow larger, memory becomes the primary bottleneck. A 16GB T4 GPU can’t load a 13B model under standard training, but with Unsloth, a 48GB GPU can train a 70B-parameter Llama 3 model. Benchmark results show dramatic improvements: training Alpaca on a single T4 GPU takes 23 hours 15 minutes with Hugging Face’s standard setup, but just 2 hours 34 minutes with Unsloth Max—an 8.8x speedup. On SlimOrca, 391 hours shrink to 51. Peak memory drops from 16.7GB to 6.9GB—a 59% reduction. Unsloth has empowered developers worldwide. With over 47,500 stars on GitHub and more than 2 million monthly model downloads, it’s become a cornerstone of open-source AI. Developers from China, Chile, Nicaragua, Guatemala, India, Italy, Turkey, and beyond have fine-tuned over 110 models using Unsloth. One of Daniel’s proudest achievements? Language localization. Most models are trained primarily on English. But Unsloth has enabled developers from non-English-speaking countries to adapt models to their native languages—Japanese, Indonesian, Korean, and dozens of Indian regional languages. A dedicated Korean translation tutorial on GitHub demonstrates how to fine-tune an English model into a Korean one—making AI accessible to billions who previously had none. Open-source remains central to Unsloth’s mission. While Pro and Max tiers offer advanced features like multi-GPU support, zero-shot training, and AMD/Intel GPU compatibility, the core framework remains free. “The real value of open source is trust,” Daniel says. “In AI, trust is scarce. If you open your code, anyone can check it, contribute, and fix bugs.” The brothers’ Discord community is vibrant, filled with collaborative problem-solving. “Everyone is friendly,” Michael says. “We’re all passionate about the same thing. Open source brings people together.” When users demand a feature, they build it. No guessing. No closed feedback loops. Today, Unsloth supports Llama, Mistral, Gemma, Phi, Qwen, DeepSeek, and more. “Our goal is always open source,” Michael insists. “We want every model to benefit from our optimizations—not just a few.” “When big companies train models with 100,000 H100s,” Daniel says, “we prove that smarter software, not just more hardware, can bring AI to everyone.”
