HyperAI超神经

DuoAttention is a new framework proposed by Han Song's team at the Massachusetts Institute of Technology (MIT) in 2024, which aims to improve the reasoning efficiency of large language models (LLMs) when processing long text contexts.DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads".

This framework optimizes the model's memory usage and computational speed by cleverly distinguishing between two types of attention heads: "Retrieval Heads" and "Streaming Heads". Retrieval heads focus on processing long-distance dependencies and require a complete key-value (KV) cache, while streaming heads focus on the most recent tokens and attention convergence points and only require a fixed-length KV cache. This design significantly reduces the model's memory usage and latency during decoding and pre-filling, while maintaining the model's ability to handle long text contexts.

DuoAttention optimizes memory and computing resources by applying a full KV cache for the retrieval head and a lightweight, fixed-length KV cache for the streaming head. This improvement not only improves the model decoding speed and pre-filling efficiency, but also reduces the latency when processing long texts. It can be reduced by up to 2.55 times for the multi-head attention (MHA) model and up to 1.67 times for the grouped query attention (GQA) model; at the same time, in terms of decoding speed, it can be increased by up to 2.18 times for the multi-head attention (MHA) model and up to 1.50 times for the grouped query attention (GQA) model; in terms of pre-filling speed, it can be increased by up to 1.73 times for the multi-head attention (MHA) model and up to 1.63 times for the grouped query attention (GQA) model, and the accuracy loss is minimal compared to the full attention mode. Notably, combined with quantization techniques, the dual attention framework enables decoding of the Llama-3-8B model with a text length of 3.3 million on a single A100 GPU.

DuoAttention Framework