HyperAIHyperAI

DuoAttention Framework

Date

a year ago

DuoAttention is a new framework proposed by Han Song's team at the Massachusetts Institute of Technology (MIT) in 2024, which aims to improve the reasoning efficiency of large language models (LLMs) when processing long text contexts.DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads".

This framework optimizes the model's memory usage and computational speed by cleverly distinguishing between two types of attention heads: "Retrieval Heads" and "Streaming Heads". Retrieval heads focus on processing long-distance dependencies and require a complete key-value (KV) cache, while streaming heads focus on the most recent tokens and attention convergence points and only require a fixed-length KV cache. This design significantly reduces the model's memory usage and latency during decoding and pre-filling, while maintaining the model's ability to handle long text contexts.

DuoAttention optimizes memory and computing resources by applying a full KV cache for the retrieval head and a lightweight, fixed-length KV cache for the streaming head. This improvement not only improves the model decoding speed and pre-filling efficiency, but also reduces the latency when processing long texts. It can be reduced by up to 2.55 times for the multi-head attention (MHA) model and up to 1.67 times for the grouped query attention (GQA) model; at the same time, in terms of decoding speed, it can be increased by up to 2.18 times for the multi-head attention (MHA) model and up to 1.50 times for the grouped query attention (GQA) model; in terms of pre-filling speed, it can be increased by up to 1.73 times for the multi-head attention (MHA) model and up to 1.63 times for the grouped query attention (GQA) model, and the accuracy loss is minimal compared to the full attention mode. Notably, combined with quantization techniques, the dual attention framework enables decoding of the Llama-3-8B model with a text length of 3.3 million on a single A100 GPU.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
DuoAttention Framework | Wiki | HyperAI