Aegaeon: 82% GPU Savings, 7 Models per GPU
GPU resource usage reduced by up to 82%—a breakthrough achieved by a collaboration between Peking University and Alibaba, enabling a single GPU to support as many as seven models. In one experiment, the number of GPUs required to serve ten models dropped from 1,192 to just 213, representing a massive 82% reduction in compute resources. This achievement comes from Aegaeon, a new multi-model serving system developed by researchers from Peking University and Alibaba, whose paper was presented at the 2025 Symposium on Operating Systems Principles (SOSP), one of the top-tier international conferences hosted by the Association for Computing Machinery (ACM). Alibaba Cloud CTO Zhou Jingren is also a co-author of the paper. Aegaeon enables dynamic, token-level scaling of models, making efficient GPU pooling possible. By scheduling model requests at the token level, Aegaeon can intelligently allocate and reallocate GPU resources across multiple models in real time, maximizing service quality while minimizing waste. The system reduces the overhead of automatic scaling by 97% through component reuse, explicit memory management, and fine-grained key-value (KV) cache synchronization. Experiments show Aegaeon maintains 2 to 2.5 times higher request arrival rates compared to existing systems, and delivers 1.5 to 9 times higher effective throughput. The system is already in beta deployment within Alibaba Cloud’s Model Studio, supporting ten models in production. At its core, Aegaeon uses a proxy layer to distribute model requests and synchronizes request metadata across underlying service instances via shared memory, ensuring load balancing and fault tolerance. Once a request is routed, the token-level scheduler determines execution order, allowing multiple models to run efficiently on the same GPU instance. A key challenge in multi-model serving is managing the complex trade-offs between token processing time and scaling latency, while still meeting strict service-level objectives (SLOs). To address this, the team designed a token-level scheduler that jointly optimizes request processing and scaling decisions. They also decoupled the prefill and decode phases of inference—where the first token and subsequent tokens have very different execution patterns—enabling independent scheduling. For the prefill stage, a group-first-come-first-served scheduler minimizes the time to first token. For decoding, a separate, optimized pipeline handles ongoing token generation. Another major innovation is cost-optimized automatic scaling. Previous systems could not support token-level scaling because operations like KV cache eviction, memory defragmentation, engine reinitialization, and cache reload could take tens of seconds, making real-time scaling impractical. Aegaeon overcomes this with three key optimizations: First, the team analyzed engine initialization and identified opportunities to reuse components, significantly reducing reinitialization overhead. Second, they implemented explicit memory management for both GPU and host memory, eliminating fragmentation and the need for time-consuming defragmentation. Third, they developed a fine-grained, synchronized KV cache transfer mechanism that overlaps computation and data movement, improving efficiency. Aegaeon achieves zero memory fragmentation—a critical enabler for GPU pooling—through several architectural innovations. It uses self-managed GPU memory buffers: upon startup, it allocates all required memory for model weights and KV caches in one contiguous block, reserving about 10% for the underlying tensor library. Memory allocation proceeds via pointer incrementation, enabling instant release simply by resetting the pointer. To bypass standard tensor library allocation routines—which trigger fragmentation—Aegaeon uses monkey patching to wrap Python classes with custom allocators backed by its self-managed buffer. Model loading is accelerated through a shared host memory “model cache” that stores raw tensor blocks from model checkpoints. Each GPU also has a dedicated “staging buffer” for efficient host-to-device transfers. When a model is scaled up and already cached in host memory, Aegaeon loads it via multi-threaded, pipelined, chunked copying—matching the performance of the best existing solutions. Additionally, Aegaeon introduces unified KV cache management using a slab allocation strategy. Instead of fragmented storage for varying KV cache shapes, it pre-allocates fixed-size blocks per shape, grouped into slabs. This ensures high memory utilization and eliminates fragmentation during runtime—similar to how a well-organized stationery manager keeps different-sized sticky notes in labeled drawers for instant access. The broader vision behind Aegaeon is to transform AI model serving from dedicated “private lanes” into a shared “highway.” With thousands of models available on platforms like Hugging Face, serving each with isolated GPUs leads to massive underutilization. Existing pooling methods typically support only two to three models per GPU—far from optimal. By enabling fine-grained, token-level scheduling and full-stack optimizations, Aegaeon allows a single GPU to dynamically serve multiple models efficiently. This breakthrough paves the way for a future where users can instantly access any AI model on demand—without worrying about backend infrastructure complexity. Reference: https://dl.acm.org/doi/10.1145/3731569.3764815 Editorial & Layout: He Chenlong
