Tencent's HunyuanWorld-Voyager: A Breakthrough in Interactive 3D Video Generation with Camera-Controlled World Exploration and Real-Time Reconstruction
We introduce HunyuanWorld-Voyager, a novel video diffusion framework that generates world-consistent 3D point-cloud sequences from a single image based on user-defined camera trajectories. Voyager enables interactive 3D scene exploration by producing temporally and spatially coherent video sequences, while also generating aligned RGB and depth video streams for efficient and direct 3D reconstruction. The model supports camera-controllable video generation, allowing users to define custom camera paths to navigate and explore virtual scenes. The output includes synchronized RGB and depth videos, which can be used for real-time 3D reconstruction and immersive world exploration. HunyuanWorld-Voyager consists of two core components: 1. World-Consistent Video Diffusion: A unified architecture that jointly generates aligned RGB and depth video sequences conditioned on prior world observations, ensuring global consistency across frames. 2. Long-Range World Exploration: An efficient world cache system with point culling and auto-regressive inference with smooth video sampling, enabling iterative extension of scenes while maintaining context-aware consistency. To train the model, the team developed a scalable data engine—a video reconstruction pipeline that automates camera pose estimation and metric depth prediction from arbitrary videos. This eliminates the need for manual 3D annotations and allows large-scale data curation. Using this pipeline, the team compiled a dataset of over 100,000 video clips combining real-world footage and synthetic renders from Unreal Engine. On the WorldScore benchmark, HunyuanWorld-Voyager achieves top performance across multiple metrics, outperforming existing models in world consistency, camera control, object control, content alignment, 3D consistency, photometric consistency, style consistency, and subjective quality. To run Voyager with a batch size of 1, the following hardware requirements are needed: - Resolution: 540p - GPU peak memory: 60GB The model is compatible with CUDA 12.4 or 11.8. Installation instructions are available for Linux systems. Users may encounter floating-point exceptions on certain GPU types, and workarounds are provided in the documentation. Pretrained models can be downloaded following the provided guide. Inference can be performed using a single GPU or multiple GPUs via xDiT, a scalable inference engine for diffusion transformers. Parallel inference across 8 GPUs achieves a 6.69x speedup compared to single-GPU inference on H20 hardware. A Gradio demo is available for easy access, allowing users to upload an image, select a camera path, input a text prompt, and generate a final RGB-D video. The data engine used to train Voyager is also released, enabling others to generate large-scale RGB-D video training data without manual 3D labeling. For academic use, the model can be cited using the following BibTeX entry: @article{huang2025voyager, title={Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation}, author={Huang, Tianyu and Zheng, Wangguandong and Wang, Tengfei and Liu, Yuhao and Wang, Zhenwei and Wu, Junta and Jiang, Jie and Li, Hui and Lau, Rynson WH and Zuo, Wangmeng and Guo, Chunchao}, journal={arXiv preprint arXiv:2506.04225}, year={2025} } The project acknowledges contributions from HunyuanWorld, Hunyuan3D-2, HunyuanVideo-I2V, VGGT, MoGE, and Metric3D for their open-source research and technical advancements.