HyperAI

Long videos are rendering current large models blind. A new research initiative led by Saining Xie, with guidance from Yann LeCun and Fei-Fei Li, introduces a transformative vision: spatial supersensing, where AI learns to predict the future instead of relying on brute-force memory. Last year, Xie’s team unveiled Cambrian-1, an open exploration into multimodal vision models. But instead of following the conventional path of releasing Cambrian-2, Cambrian-3, the team paused to ask a fundamental question: What does true multimodal intelligence really mean? Is the current large language model paradigm even suitable for perception? In a tweet, Xie wrote: “Something essential is missing. You can’t build superintelligence before you build supersensing.” This isn’t about better sensors or higher-resolution cameras—it’s about how a digital agent truly experiences the world, absorbing endless streams of input and learning from them in a meaningful way. As Andrej Karpathy has noted, for real-world AI agents, perception modeling may be the core of intelligence itself. The team proposed a new taxonomy of intelligence levels: - Level 0: Pure language understanding - Level 1: Semantic perception (e.g., “describe the image”) - Level 2: Streaming event cognition (the foundation of real-time assistants) - Level 3: Implicit 3D spatial understanding (treating video as a projection of a 3D world) - Level 4: Predictive world modeling (inference through forecasting possible future states) Current multimodal large language models (MLLMs) mostly operate at Levels 0–2, with only a few reaching Level 3. Level 4—predictive world modeling—remains almost entirely unexplored. This is the starting point of the team’s new paper, Cambrian-S: Towards Spatial Supersensing in Video, released in November 2025. The paper not only introduces the concept of spatial supersensing but also builds a new benchmark, dataset, and model to test a critical claim: today’s MLLMs systematically fail at true spatial perception tasks. The team first conducted a systematic review of existing video understanding benchmarks. They found that most tests focus on early levels—object recognition, short-term event description—while few assess actual spatial reasoning or world modeling. Even benchmarks labeled as “spatial reasoning” often rely on text shortcuts. For example, a question about a moon colliding with Earth in VideoMME requires only physics knowledge, not visual-spatial understanding. Another query about an astronaut’s gear tests memory of NASA facts, not spatial awareness. To close this gap, the team created VSI-SUPER, a benchmark for visual-spatial intelligence. It includes two subtasks: VSR (Visual-Spatial Recall) and VSC (Visual-Spatial Counting), both using videos up to several hours long. The model must not only “see” but also “remember” and “understand” how objects change in space over time. The results were striking. Commercial models like Gemini-Live and GPT-Realtime scored less than 15% average relative accuracy (MRA) on 10-minute videos. When the video length reached 120 minutes, their performance dropped to near zero. Despite their “long context” claims, these models fail at sustained spatial tracking. The root issue, the team argues, is that MLLMs are stuck in the first three levels. The real breakthrough lies in Level 4: predictive world modeling. To address this, they built VSI-590K, a large-scale video instruction-tuning dataset with approximately 590,000 samples. Data sources include high-quality human-annotated real videos, synthetic simulations, and automatically generated pseudo-labels from web videos. A full pipeline used GroundingDINO for object detection, SAM2 for segmentation, and VGGT for 3D point cloud estimation to generate geometry-aware question-answer pairs. On this data, the team trained the Cambrian-S model family, ranging from 0.5B to 7B parameters. Training followed four stages: visual-language alignment, image instruction tuning, general video instruction tuning, and spatial video instruction tuning. Results showed strong performance: Cambrian-S-7B achieved 67.5% accuracy on VSI-Bench—outperforming open-source baselines like InternVL3.5-8B and Qwen-VL-2.5-7B, and beating the commercial Gemini-2.5-Pro by over 16 percentage points. It also held its own on general video benchmarks like Perception Test and EgoSchema. However, even Cambrian-S struggled when video length exceeded 60 minutes—proving that scaling data and model size alone cannot overcome the fundamental limitations of the current paradigm. The solution? A paradigm shift: predictive sensing. Inspired by human cognition, the brain doesn’t passively record everything. Instead, it constantly predicts what’s next and focuses on what’s surprising. The team implemented this in Cambrian-S with a latent frame prediction head—a two-layer MLP that predicts the next video frame’s representation while performing token prediction. During training, the model is optimized using mean squared error and cosine similarity between predicted and actual features. During inference, the prediction error becomes a “surprise score.” Low-surprise frames—those the model can predict—are compressed and stored in long-term memory. High-surprise frames—indicating meaningful change—are preserved in detail. This mechanism allows the model to manage infinite video streams with finite memory. In the VSC task, the model uses surprise-driven event segmentation: it accumulates frames in a buffer, generates a summary when a high-surprise event occurs, and resets the buffer. This enables natural, semantic segmentation of continuous visual input. Experiments confirmed the method’s effectiveness. On VSR, Cambrian-S maintained stable accuracy across video lengths, with constant GPU memory usage. It outperformed both Gemini 1.5 Flash and Gemini 2.5 Flash, avoiding the performance collapse seen in context-expansion-only models. On 120-minute videos, it achieved around 28% accuracy in VSC—far above the near-zero results of commercial models. Still, the team acknowledges this is just a first step. The VSI-SUPER benchmark, the VSI-590K dataset, and the Cambrian-S model are initial explorations. The scope is limited, data diversity needs expansion, and generalization remains a challenge. Future work will explore more diverse, embodied scenarios and integrate with advances in vision, language, and world modeling. The key insight remains: long context is not enough. What’s needed is a smart memory system—knowing what to record, what to compress, and what to forget. The future of video understanding lies not in bigger memories, but in smarter prediction. The paper, code, model weights, and datasets are all publicly available on GitHub and Hugging Face.

Related Links

Related Links

Related Links

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Command Palette

Predictive Sensing: The Key to True Video Intelligence

Related Links

Command Palette

Predictive Sensing: The Key to True Video Intelligence

Related Links

Command Palette

Predictive Sensing: The Key to True Video Intelligence

Related Links

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models