Why GPU Workloads Should Separate Prefill and Decode
Large language model inference consists of two distinct phases: prefill and decode, which place opposing demands on GPU hardware. Prefill processes entire prompts in parallel, relying heavily on tensor core compute power, while decode generates tokens sequentially, constrained primarily by memory bandwidth. Running both phases on the same GPU pool leads to severe resource inefficiency. During prefill, GPU utilization can reach 90 percent, but it drops to 20 percent during decode, leaving the system paying for high-end compute capacity it does not need for the majority of the request lifecycle. To address this, the industry is shifting toward disaggregated inference, a strategy that separates prefill and decode onto distinct hardware pools. This approach was popularized by a 2024 paper from UC San Diego's Hao AI Lab and has since been adopted in production by major companies including Perplexity, Meta, and Mistral, as well as NVIDIA's Dynamo framework. In this architecture, a KV-aware router directs incoming requests to specialized prefill workers. Once the prompt is processed and the key-value cache is generated, it is transferred over a high-speed network to decode workers, which handle the autoregressive token generation. The primary economic benefit is right-sizing hardware. Prefill workers can be optimized for high floating-point operations with less memory bandwidth, while decode workers require massive HBM capacity and bandwidth but less raw compute. This allows organizations to avoid overprovisioning; a GPU sized for prefill peaks is often wasteful during decode, and a decode-optimized machine cannot handle prefill bursts. Disaggregated setups can reduce infrastructure costs by 15 to 40 percent by eliminating idle compute and memory resources. However, this separation introduces a new cost: the latency of moving the key-value cache across the network. For large models like Llama 70B with long prompts, the cache can exceed 1 gigabyte. Without high-bandwidth, low-latency connections like RDMA, this transfer can significantly delay the time to the first token. To mitigate this, implementations like Perplexity's use pipelined transfers, allowing the decode worker to start processing early layers of the cache while later layers are still in transit. This reduces effective transfer latency to under 30 milliseconds in optimal setups. Disaggregated inference is not universally beneficial. It adds complexity and network overhead that may outweigh benefits for short prompts, small clusters with fewer than 16 GPUs, or workloads with high prefix cache hit rates where much of the prefill work is redundant. The technology shines in large-scale, prompt-heavy environments where consistent latency and cost efficiency are critical. As adoption grows, software frameworks such as vLLM, SGLang, and llm-d now support native disaggregated serving, enabling independent scaling of prefill and decode pools. While current deployments still use standard GPU models for both phases, future silicon designs may further optimize each role by removing unnecessary components. For enterprise teams managing real-time LLM products, understanding the compute-memory divide is essential. Deciding to disaggregate requires evaluating prompt length, network capabilities, and cluster size to ensure the architecture delivers the intended performance and cost gains.
