Lily Li: AGI Is Marketing, Space Intelligence Is AI's Next Frontier
Li Fei-Fei has once again brought the AI world to attention with her latest reflections on the future of artificial intelligence, arguing that the much-heralded concept of AGI—Artificial General Intelligence—is more of a marketing term than a scientific one. In a recent in-depth interview on Lenny’s Podcast, she made a compelling case for “spatial intelligence” as the true missing piece in today’s AI systems, and she laid out a vision for the next decade of AI development through her new company, World Labs, and its groundbreaking product, Marble. The conversation, which spanned over an hour, began with a personal and historical journey. Li recalled the early 2010s, when the term “AI” was so controversial in Silicon Valley that many companies avoided it, fearing it would be seen as a buzzword or a dead end. The turning point came in 2012, when Geoffrey Hinton’s team used deep neural networks to win the ImageNet challenge—marking the birth of modern AI. But behind that breakthrough was Li’s own foundational work: the creation of ImageNet, a massive dataset of 15 million labeled images across 22,000 categories, which provided the essential training data that made large-scale deep learning possible. “Today’s ChatGPT still runs on the same core ingredients: internet-scale data, neural network architectures, and massive GPU power,” she said. “The recipe hasn’t changed much—only the scale.” Yet, as large language models (LLMs) have taken over, Li has turned her focus to a different frontier: the human ability to understand, navigate, and act in a three-dimensional world. “LLMs are like brilliant wordsmiths in the dark,” she said. “They can write, reason, and even generate text, but they lack real-world experience. They can’t count the number of chairs in a room from video, let alone predict how a ball will roll or how to open a door.” This is where spatial intelligence comes in—the cognitive ability to perceive, reason about, and interact with physical space. From parking a car to building a house, from catching a ball to discovering the double helix of DNA, all of these require a deep, embodied understanding of space. “Human intelligence isn’t just about language,” she emphasized. “It’s about how things are arranged in space, how they move, how they interact. That’s the foundation of real understanding.” This insight led her to the idea of a “world model”—a generative, multi-modal, and interactive representation of a 3D world that can be navigated, explored, and acted upon. In 2022, she began developing this concept, and in 2024, she co-founded World Labs with Justin Johnson, Christoph Lassner, and Ben Mildenhall. The company’s first product, Marble, launched in November, is the world’s first generative 3D world model. Unlike video generation tools that output flat, two-dimensional sequences, Marble creates a full, navigable 3D environment. Users can move through the world, change camera angles, and even export specific views. The technology is not just a novelty—it’s already being used in film production, where studios report a 40-fold reduction in virtual set creation time. Game developers are using it to generate VR and game assets. Researchers in psychology are using it to create controlled, immersive environments to study how people respond to cluttered or clean spaces. And roboticists are seeing it as a way to generate vast amounts of synthetic training data for robots in 3D environments. But why isn’t the “bitter lesson”—the idea that simple models with more data outperform complex models with less—solving robotics? Li explained that while the lesson works for language, it fails in physical systems. “In language, data and model output are perfectly aligned: text in, text out. But in robotics, the data is video, and the output is action in 3D space. The mismatch is huge. We need models that understand the world, not just describe it.” She also warned that even with more data and more compute, we’re not close to true human-level intelligence. “We can’t even get an AI to derive Newton’s laws from observations of motion, even with all the data from the 20th century. We’re not even close to that.” And emotional intelligence? “A student walks into a professor’s office, talks about passion, struggle, fear. No AI can truly understand that depth.” Despite this, Li remains optimistic. “AI is the youngest scientific field in human history. We’re still on the surface. No mature science ever said, ‘We’re done.’ We’re just beginning.” She also challenged the AGI narrative. “AGI is not a scientific term. It’s a marketing term. What does it mean? Can a machine be an economic agent? Can it think like a human? There’s no consensus. I don’t know what it is. But I do know that we need more innovation—not just bigger models, but new ideas.” Finally, she returned to her core belief: AI must be human-centered. “You don’t need to be a coder to matter in the AI era. Whether you’re a teacher, nurse, farmer, or artist—your role is essential. AI should empower you, not replace you. The most important question is: how do we build technology that respects human dignity, agency, and purpose?” In the end, Li Fei-Fei’s message is clear: the next frontier of AI isn’t more language, but more world. And the key to building it lies not in data alone, but in understanding the world as humans do—spatially, physically, and meaningfully.
