HyperAIHyperAI

Command Palette

Search for a command to run...

Fei-Fei Li and Yann LeCun Both Bet on "World Models" — But Their Visions Differ: From 3D Renderings to AI Cognition

The term "world model" has become a hot label in AI, but it now represents three very different visions, each driven by distinct goals and technologies. At the center of this divergence are Fei-Fei Li, Yann LeCun, and DeepMind—each betting on a different interpretation of what a world model should be. Fei-Fei Li’s World Labs launched Marble, a tool that turns text, images, or layouts into interactive 3D scenes viewable in a browser. The system uses 3D Gaussian splatting to generate photorealistic, walkable environments. On the surface, this seems like a leap toward a "world model"—a digital world that can be explored. But in practice, Marble is a 3D content creation pipeline. It outputs static assets like splats and meshes, which are then rendered in game engines such as Unity or Three.js. The "world" it models is visual and human-facing, not a thinking system. It’s a tool for designers, not a cognitive engine. As one ML engineer put it, it’s a "Gaussian Splat model"—not a robot’s brain. While Li’s manifesto, From Words to Worlds, talks about embodied agents, commonsense physics, and robots that act in the world, the current version of Marble doesn’t go that far. It’s a step one, but not a model of the world in the way that would support autonomous reasoning. Yann LeCun, Meta’s former chief AI scientist, is taking a completely different path. His vision of a world model comes from control theory and cognitive science, not 3D graphics. In his 2022 paper, A Path Towards Autonomous Machine Intelligence, he describes a system where an agent uses a latent internal model to predict future states, not to render pretty images. The model doesn’t need to be visual—it needs to be predictive. It’s about learning the dynamics of the world: how things move, how actions lead to consequences. This is the essence of JEPA (Joint Embedding Predictive Architecture) models, which learn to predict masked or future representations, not raw pixels. LeCun’s new startup, if it materializes, is expected to build on this idea: a system that enables machines to think ahead, plan, and reason—like a brain, not a game engine. The focus is on internal cognition, not external display. Then there’s DeepMind’s Genie 3, which sits in the middle. It generates continuous, interactive video at 720p and 24 fps, allowing users to move through a virtual world, trigger events, and see objects respond over time. It’s not a static 3D asset—it’s a dynamic, controllable simulation. The world persists across frames, and agents can learn from it. DeepMind positions Genie as a "new frontier for world models," a virtual training ground for AI agents and robots. It’s a simulator, not a viewer, and it’s designed for training, not just display. It’s closer to LeCun’s vision in function—supporting agent learning—while using a more visual, real-time output. So what’s the real difference? - Marble is a world model as interface: a human-friendly 3D viewer built on Gaussian splatting. - Genie 3 is a world model as simulator: a real-time, interactive environment for training agents. - LeCun’s world model is a world model as cognition: a latent, predictive system inside an agent’s mind. The confusion arises because all three use the same word, but for very different things. The key is to ask: - Is this for humans to look at, or for agents to act in? - Does it output static assets, real-time video, or internal states? - If you knock over a vase, does the system remember it for more than one frame? If the answer is "for humans," "static," and "no," it’s a 3D asset tool. If it’s "for agents," "real-time," and "yes," it’s a true world model in the cognitive sense. The real race isn’t about who can make the prettiest 3D scene, but who can build a system that lets machines understand, predict, and act in the world—beyond just generating the next word.

Related Links