HyperAI

The rapid growth of artificial intelligence has brought with it a hidden challenge: the AI memory crisis. While much attention has focused on the soaring costs of training large models, the real bottleneck today may be inference—the process of delivering AI responses to users. According to Liran Zvibel, CEO of WEKA, an AI storage company powering major frontier labs and AI cloud providers, memory limitations are now the primary constraint on AI scalability. Zvibel explains that while training is compute-intensive, inference is memory-bound. Even the most advanced GPUs, such as Nvidia’s 300GB Blackwell Ultra, have limited high-speed memory. For models like Meta’s Llama, which can require nearly 500GB of memory per instance, even a modest number of concurrent users can exhaust available resources. A 100,000-token context window—common in modern AI systems—already demands 50GB of memory. This quickly leads to what Zvibel calls the “AI memory wall,” a hard limit on how many users a system can serve simultaneously. The result? Sluggish responses, rate limiting, and frustrated users—common experiences with platforms like ChatGPT. “We are not only wasting GPUs,” Zvibel said, “we’re delivering poor service to end users.” The inefficiency stems from repurposing training-focused infrastructure for inference, where memory, not raw compute, is the limiting factor. The situation is expected to worsen with the rise of agentic AI—systems that perform complex, multi-step reasoning. These models will require even longer context windows, more memory for verification, and greater reasoning capacity. Without solutions, the number of AI agents running in parallel could quickly overwhelm existing systems. Zvibel draws a clear distinction between training and inference economics. Training costs are often seen as justified for breakthroughs, but inference must eventually generate value. “With training, there’s no amount of spend that doesn’t make sense,” he said. “But inference has to correlate with the world’s population—the actual market—and the resources you have.” Some companies are already making progress. DeepSeek has demonstrated that efficiency gains are possible through memory optimization techniques like key-value caching and disaggregated prefill. Cohere, a WEKA customer using CoreWeave’s infrastructure, reduced GPU warm-up time from 15 minutes to seconds, cutting time to first token by half and increasing concurrent token throughput by four to five times. These improvements are critical. Earlier this year, The Information reported that inference consumed nearly 60% of OpenAI’s revenue—highlighting the financial stakes. Looking ahead, Zvibel believes older GPUs won’t become obsolete. Instead, they can be repurposed for inference tasks that don’t require the latest hardware. The most powerful chips should be reserved for compute-heavy prefill operations, while older models can handle decoding—offloading work and extending hardware lifespans. Ultimately, solving the memory wall isn’t just about performance. It’s about making AI infrastructure more economical, scalable, and sustainable. As Zvibel puts it: “Unlike training, where you need to win on the outcomes, inference must win on economics.” Without efficient memory management, the promise of widespread, affordable AI may remain out of reach.

AI’s Hidden Bottleneck: Why Memory Limits Are Slowing Inference and How WEKA Is Fixing It

Related Links