DGGT: 0.4s 4D Autonomous Driving Reconstruction
Researchers from the Institute for AI Research (AIR) at Tsinghua University, led by Assistant Professor Hao Zhao, have unveiled DGGT (Driving Gaussian Grounded Transformer)—the first feedforward, pose-free 4D reconstruction framework designed for large-scale dynamic driving scenes. Developed in collaboration with Xiaomi Auto and other institutions, DGGT enables high-speed, scalable 3D scene reconstruction directly from sparse, uncalibrated images—eliminating the need for camera calibration, short temporal windows, or per-scene optimization. Traditional 3D reconstruction methods rely heavily on precise camera poses and iterative optimization, limiting their speed and scalability. DGGT breaks this paradigm by predicting camera poses as part of its output, enabling single-pass inference that reconstructs dynamic scenes in under 0.4 seconds. The framework simultaneously generates camera trajectories, depth maps, dynamic instance segmentation, and a 3D Gaussian-based editable scene representation—delivering a complete 4D asset in one forward pass. Trained exclusively on the Waymo dataset, DGGT demonstrates strong zero-shot generalization on unseen benchmarks such as nuScenes and Argoverse2, outperforming state-of-the-art methods like STORM by over 50% in key perception metrics. It achieves a 3D endpoint error (EPE3D) of 0.183 meters—among the best reported results—while maintaining high visual fidelity and temporal consistency. At the core of DGGT is a Vision Transformer (ViT) encoder fused with DINO priors, leveraging shared features through alternating attention. Multiple parallel prediction heads jointly estimate camera poses, depth, dynamic segmentation, motion vectors, sky regions, and lifespan-aware appearance evolution. A single-step diffusion refinement module further enhances spatial and temporal coherence, effectively suppressing motion interpolation artifacts and improving rendering realism. One key innovation is the Lifespan Head, which models gradual appearance changes in static regions—such as shifting shadows, lighting variations, and reflections—over time. Ablation studies show that removing this component reduces PSNR by 3.2 dB, confirming its critical role in preserving long-term consistency and realism. The Motion Head enables dense pixel-level 3D correspondence across frames, ensuring reliable tracking of dynamic objects. Visualizations demonstrate accurate alignment of corresponding points across adjacent frames, minimizing ghosting and motion blur during interpolation. DGGT also supports real-time, instance-level scene editing directly within the 3D Gaussian representation. Users can add, remove, or move vehicles and pedestrians, with the diffusion refinement module automatically filling holes and smoothing boundaries. This capability transforms DGGT into a powerful tool for generating synthetic, editable 4D environments—ideal for autonomous driving simulation, data augmentation, and benchmarking. The framework scales efficiently with input complexity: increasing the number of input views from 4 to 16 maintains stable reconstruction and novel view synthesis performance, unlike competing methods that degrade under higher input loads. This makes DGGT highly suitable for processing large-scale autonomous driving logs. The project is open-sourced at https://github.com/xiaomi-research/dggt, with a dedicated project page at https://xiaomi-research.github.io/dggt/. The research team includes Chen Xiaoxue, Xiong Ziyi, Chen Yuantuo, Li Gen, Wang Nan, Luo Hongcheng, Chen Long, Sun Haiyang, Wang Bing, Chen Guang, Ye Hangjun, Li Hongyang, Zhang Yaqin, and Zhao Hao. In parallel, AIR’s Executive Director Yang Liu and collaborators from Tsinghua University and Fudan University introduced EscapeCraft—a 3D immersive escape room environment designed to rigorously evaluate multimodal large models’ reasoning and decision-making abilities in complex visual tasks. Evaluations revealed persistent shortcomings: models often failed to connect perception with action—seeing a door but walking around walls, picking up keys without knowing how to use them, or attempting to “grab” a sofa under the assumption it might hide a secret compartment. Even GPT-4o succeeded only on a limited subset of subtasks, with many “correct” behaviors stemming from luck rather than genuine understanding. Together, these advances mark a pivotal step toward building AI systems that not only perceive scenes but truly comprehend and reason within them—bringing autonomous systems closer to human-level situational awareness.
