Diff Transformer
Differential Transformer (Diff Transformer for short) is a new Transformer architecture, jointly proposed by Microsoft Research and Tsinghua University in 2024. The related paper results are "Differential Transformer”, 4 co-authors: Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun. The core of this architecture lies in its differential attention mechanism, which aims to solve the problem that traditional Transformer has difficulty in accurately retrieving key information when processing long texts, which is the so-called "intermediate information loss" phenomenon.
Diff Transformer calculates two independent softmax attention maps and then takes the difference to get the final attention score. This method can effectively eliminate attention noise and encourage the model to pay more attention to the most relevant part of the input. This mechanism is similar to noise-canceling headphones and differential amplifiers in electrical engineering, which eliminate noise by the difference between two signals.
Experimental results show that Diff Transformer outperforms traditional Transformer in language modeling tasks under various settings. It is not only scalable in terms of model size and number of training tokens, but also shows significant advantages in practical applications such as long-context modeling, key information retrieval, hallucination relief, and contextual learning. In addition, Diff Transformer can effectively reduce outliers in model activation values, is more friendly to model quantization, and improves model efficiency.
The introduction of Diff Transformer provides new ideas for the development of large language models and is expected to play an important role in many fields such as intelligent dialogue systems, text generation, and data extraction.