Startup Modular Challenges Nvidia’s CUDA Dominance with Portable AI Software Stack
In Silicon Valley, where bold technical ambitions are common, few challenges are as audacious as trying to break Nvidia’s dominance in the AI world. That’s exactly what Modular, a startup founded by software veterans from Apple and Google, is attempting to do. At the heart of the effort is a direct challenge to CUDA, Nvidia’s software platform that has become the de facto standard for AI development. Chris Lattner, Modular’s CEO and co-founder, is no stranger to transformative software. He created Apple’s Swift programming language and played a key role in building the software stack behind Google’s TPU AI chips. With Tim Davis, his co-founder and president, Lattner is now focused on building a new software layer that could make AI development independent of any single hardware vendor. CUDA, originally designed to make graphics chips programmable, has evolved into a comprehensive ecosystem that includes compilers, libraries, and tools. It’s now deeply embedded in the AI industry, with most models trained and run on Nvidia GPUs. This has created a powerful lock-in: developers are locked into Nvidia’s hardware and software, making it difficult to switch to alternatives like AMD’s GPUs, Google’s TPUs, or Amazon’s Trainium chips. While there are many AI chip options on the market, each comes with its own software stack, requiring developers to rewrite code for every new hardware platform. This fragmentation has made it easier to stick with CUDA, even though many developers want the freedom to use different chips without retooling their entire workflow. Lattner sees this as a major opportunity. “Nobody is building portable stuff because why would anyone work on software for more than one chip when the chip projects themselves are doing the software?” he said. Nvidia, he notes, has no incentive to make CUDA work on rival hardware, as that would weaken its competitive moat. Modular’s solution starts with Mojo, a new programming language designed to be as easy to use as Python but as fast and powerful as C++. It’s built to give developers fine-grained control over AI hardware while remaining accessible. Mojo also integrates with PyTorch, a widely used framework for AI development. The next layer is MAX, a system for AI inference that runs efficiently on Nvidia, AMD, and Apple GPUs. In September, Modular announced that its software achieved top performance on both Nvidia’s Blackwell B200 and AMD’s MI355X chips—on the same platform. Most notably, the AMD chips ran 50% faster under Modular’s stack than they did with AMD’s own software. Another layer, Mammoth, helps manage large GPU clusters, making it easier to scale AI workloads. The ability to run different hardware on a single software stack opens the door to real competition. “Can the MI355X compete with Blackwell?” Modular asked in a blog post. Early results suggest yes. One of Modular’s early adopters is Inworld AI, which builds real-time conversational AI for companies like Disney and NBCUniversal. CEO Kylan Gibbs challenged Modular to cut costs by 60% and reduce latency by 40% on new Nvidia hardware. Within four weeks, the team delivered. “I’ve bet with my wallet,” Gibbs said, adding that the flexibility to switch hardware in the future is a major advantage. While some worry that Nvidia could simply extend CUDA to rival chips, Lattner argues that Modular isn’t trying to destroy Nvidia. Instead, he compares the project to Android—successful, open, and enabling competition without killing the dominant player. Just as Android coexists with iOS, Modular aims to make AI hardware more open and diverse, fostering innovation across the industry. “Nvidia doesn’t have to die,” Lattner said. “But we do want more competition. We want more choice. And I think that’s good for the world.”
