HyperAI

Knowledge Distillation

Knowledge distillation is a machine learning technique that aims to transfer the learning results of a large pre-trained model ("teacher model") to a smaller "student model". It is used as a form of model compression and knowledge transfer in deep learning, and is particularly suitable for large-scale deep neural networks.

The goal of knowledge distillation is to train a more compact model to simulate a larger, more complex model. While the goal of traditional deep learning is to train an artificial neural network to make its predictions closer to the output examples provided in the training dataset, the main goal of knowledge distillation is to train the student network to match the predictions of the teacher network.

Knowledge distillation (KD) is most commonly used with large deep neural networks with many layers and learnable parameters. This process is particularly relevant to the emerging large-scale generative AI models with billions of parameters.

The concept originated in a 2006 paper titled The "Model Compression" paper by Caruana et al. used a state-of-the-art classification model (a large ensemble model consisting of hundreds of base classifiers) to label a large dataset, and then trained a single neural network on the newly labeled dataset through traditional supervised learning.

Knowledge distillation techniques have been successfully applied in various fields, including natural language processing (NLP), speech recognition, image recognition, and object detection. In recent years, research on knowledge distillation has been particularly important for large language models (LLMs). For LLMs, knowledge distillation has become an effective means of transferring advanced features from leading proprietary models to smaller, more accessible open source models.

References

【1】https://www.ibm.com/topics/knowledge-distillation