DeepSeek's Chatbot R1 Sparks Controversy and Highlights the Power of AI Distillation Techniques
Chinese AI company DeepSeek made headlines this year with the release of its chatbot, R1. The attention was partly due to R1’s performance, which reportedly rivaled that of renowned AI giants like OpenAI, Google, and Anthropic, despite using significantly less computational power and resources. This achievement caused a stir in the AI industry, leading to a precipitous drop in the stock values of major tech companies, with Nvidia losing more value in a single day than any company in history. DeepSeek's success sparked accusations that the company had illicitly obtained knowledge from OpenAI’s proprietary model, o1, using a technique known as distillation or knowledge distillation. However, distillation is a well-established and widely used method in AI, dating back to 2015. The technique involves creating smaller, more efficient "student" models by training them on the outputs of larger, more complex "teacher" models. These outputs include probability distributions for various predictions, providing nuanced information that helps the student model learn faster and more accurately. The concept was initially introduced by researchers at Google, including Geoffrey Hinton, often referred to as the "godfather of AI." Hinton, along with Oriol Vinyals and another colleague, recognized that traditional machine learning algorithms treated all incorrect answers equally, which was inefficient. They proposed that a smaller model could benefit from understanding the relative likelihoods of different wrong answers, a notion they called "dark knowledge." Vinyals developed a method to transfer this "dark knowledge" from a large model to a smaller one by focusing on "soft targets"—probabilities assigned to various outcomes. For example, if a large model classifies an image as 30% likely to be a dog, 20% a cat, 5% a cow, and 0.5% a car, the probabilities convey that dogs and cats are more similar to each other than either is to cows or cars. This information allows the smaller model to learn more efficiently, achieving similar performance with fewer resources. While the initial paper faced rejection, distillation gained traction as models grew larger and more resource-intensive. Google’s BERT, a powerful language model, was distilled into a smaller, more cost-effective version called DistilBERT in 2019. DistilBERT quickly became a popular choice in both business and research settings. Today, distillation is a standard practice used by major AI players, including Google, OpenAI, and Amazon. However, the process of distillation requires access to the internal workings of the teacher model, making it impossible for a third party to clandestinely distill data from closed-source models like OpenAI’s o1. Instead, a third party can train a student model by repeatedly querying the teacher model and using the responses, a method akin to Socratic questioning. Recent research continues to explore new applications of distillation. In January, the NovaSky lab at the University of California, Berkeley, demonstrated that distillation is effective for training chain-of-thought reasoning models. Their open-source Sky-T1 model, costing less than $450 to train, achieved comparable results to much larger models. Dacheng Li, a doctoral student and co-student lead of the NovaSky team, expressed surprise at the technique’s success in this context, emphasizing its fundamental importance in AI. Industry insiders view distillation as a crucial tool for making AI more accessible and efficient. The technique not only reduces the computational costs but also democratizes advanced AI capabilities by enabling smaller companies to achieve results previously attainable only by tech behemoths. Companies like DeepSeek are pushing the boundaries by leveraging distillation to create competitive AI solutions at a fraction of the usual cost, highlighting the potential for rapid innovation in the AI landscape.