Command Palette
Search for a command to run...
HY-World 2.0: Ein Multi-Modal World Model zur Rekonstruktion, Generierung und Simulation von 3D-Welten
HY-World 2.0: Ein Multi-Modal World Model zur Rekonstruktion, Generierung und Simulation von 3D-Welten
Zusammenfassung
Da Sie mich angewiesen haben, die Übersetzung in deutscher Sprache zu verfassen (obwohl der Ausgangstext Englisch ist und die Zielsetzung eigentlich eine chinesische Übersetzung war), werde ich den Text gemäß Ihren hohen professionellen Standards für den Tech-Sektor ins Deutsche übertragen.Hier ist die professionelle Übersetzung:Wir stellen HY-World 2.0 vor, ein multimodales World-Model-Framework, das unser vorangegangenes Projekt HY-World 1.0 entscheidend weiterentwickelt. HY-World 2.0 unterstützt diverse Input-Modalitäten, einschließlich Text-prompts, Single-View-Images, Multi-View-Images und Videos, und generiert daraus 3D-Weltrepräsentationen. Bei Verwendung von Text oder Single-View-Image-Inputs führt das Modell eine World-Generation durch und synthetisiert hochpräzise, navigierbare 3D Gaussian Splatting (3DGS)-Szenen. Dies wird durch ein vierstufiges Verfahren erreicht: a) Panorama-Generierung mit HY-Pano 2.0, b) Trajektorienplanung mit WorldNav, c) Welt-Expansion mit WorldStereo 2.0 und d) Welt-Komposition mit WorldMirror 2.0.Insbesondere führen wir maßgebliche Innovationen ein, um die Panorama-Fidelity zu erhöhen, 3D-Szenenverständnis sowie Planung zu ermöglichen und WorldStereo – unser auf Keyframes basierendes View-Generation-Modell mit konsistentem Memory – aufzuwerten. Zudem optimieren wir WorldMirror, ein Feed-Forward-Modell zur universellen 3D-Prädiktion, durch eine Verfeinerung der Modellarchitektur und der Lernstrategie, was die Welt-Rekonstruktion aus Multi-View-Images oder Videos ermöglicht. Darüber hinaus stellen wir WorldLens vor, eine leistungsstarke 3DGS-Rendering-Plattform, die sich durch eine flexible, engine-agnostische Architektur, automatisches IBL-Lighting, effiziente Kollisionserkennung und ein Training-Rendering-Co-Design auszeichnet; dies ermöglicht die interaktive Exploration von 3D-Welten inklusive Character-Support.Umfangreiche Experimente belegen, dass HY-World 2.0 auf mehreren Benchmarks die State-of-the-Art-Performance unter den Open-Source-Ansätzen erreicht und Ergebnisse liefert, die mit dem Closed-Source-Modell Marble vergleichbar sind. Um die Reproduzierbarkeit zu gewährleisten und die weitere Forschung an 3D-World-Models zu unterstützen, veröffentlichen wir alle Model-Weights, den Code sowie die technischen Details.
One-sentence Summary
Tencent Hunyuan introduces HY-World 2.0, a multi-modal world model framework that reconstructs and generates navigable 3D Gaussian Splatting scenes from text, single-view images, multi-view images, or videos through a four-stage method involving HY-Pano 2.0, WorldNav, WorldStereo 2.0, and WorldMirror 2.0, while utilizing the WorldLens rendering platform for interactive 3D simulation.
Key Contributions
- The paper introduces HY-World 2.0, a multi-modal framework that transforms diverse inputs like text, single-view images, and videos into navigable 3D Gaussian Splatting scenes through a four-stage pipeline of panorama generation, trajectory planning, world expansion, and world composition.
- The framework incorporates several specialized components, including HY-Pano 2.0 for high-fidelity panorama synthesis, WorldNav for semantic-aware trajectory planning, WorldStereo 2.0 for memory-driven view generation, and WorldMirror 2.0 for universal 3D reconstruction.
- The researchers developed WorldLens, a high-performance 3DGS rendering platform featuring an engine-agnostic architecture and automatic lighting to enable interactive 3D world exploration with character support.
Introduction
3D world models are essential for advancing spatial intelligence in robotics, virtual reality, and game development by enabling agents to simulate and interact with complex environments. However, current research is often split between generative models that create immersive scenes from sparse inputs but lack geometric accuracy, and reconstruction methods that recover precise structures from dense data but cannot hallucinate unseen regions. The authors leverage a unified multi-modal framework called HY-World 2.0 to bridge this gap, providing a single system capable of both high-fidelity world generation from text or single images and accurate world reconstruction from multi-view videos. Their contribution includes a novel four-stage pipeline consisting of panorama generation, trajectory planning, memory-driven world expansion, and world composition to produce navigable 3D Gaussian Splatting scenes.
Dataset

The authors construct a large-scale hybrid dataset for HY-World 2.0 by scaling up the original HY-World 1.0 framework. The dataset composition and processing details are as follows:
- Dataset Composition and Sources: The dataset integrates two primary sources to enhance semantic diversity and bridge the domain gap between real and synthetic environments.
- Real-world captures: A massive collection of high-resolution panoramas used to provide authentic lighting, complex textures, and natural structural priors.
- Synthetic assets: Large-scale environments rendered via high-end engines like Unreal Engine (UE) to provide diverse scene configurations and pixel-accurate ground-truth geometry.
- Data Filtering and Quality Control: The authors implement a rigorous filtering stage to remove low-quality samples, specifically targeting images with visible stitching artifacts or exposed capturing equipment such as panoramic cameras.
- Pseudo-Label Enhancement Strategy: To improve real-world data, the authors utilize a normal-only pseudo-labeling approach rather than monocular depth estimation. Because independent depth predictions often cause multi-view geometric inconsistencies, the authors use a monocular normal estimation teacher model to predict dense surface normals.
- Training and Supervision: These predicted normals serve as pseudo-supervision targets in two ways. They are used directly for the normal head via an angular loss and indirectly for the depth head through a specialized depth-to-normal loss.
Method
The HY-World 2.0 framework operates as a four-stage pipeline, integrating multi-modal inputs into immersive, navigable 3D worlds. The process begins with Panorama Generation, where arbitrary text or image inputs are translated into a high-fidelity 360° world initialization. This stage leverages a Multi-Modal Diffusion Transformer (MMDiT) to perform an implicit, adaptive mapping from perspective inputs to equirectangular projection (ERP) space. The MMDiT processes both the conditional input and the target panoramic output within a unified latent space, concatenating their respective latent sequences. This self-attention mechanism enables the network to autonomously learn the underlying perspective-to-ERP transformation without requiring explicit camera priors, allowing it to hallucinate missing details and maintain global structural coherence. To address the discontinuity at the left and right edges of the ERP, a combined refinement strategy is employed. At the latent level, circular padding enforces periodic boundary conditions during the denoising process, while in the pixel space, a linear blending strategy smooths the 360^\\circ wrap-around transition, ensuring a perfectly seamless and structurally coherent panoramic output.

Following the generation of a high-fidelity panorama, the system proceeds to Trajectory Planning. This stage, implemented by the WorldNav strategy, aims to derive exploration trajectories that maximize navigable space coverage. The process begins with a comprehensive geometric and semantic scene parsing of the panorama. This involves constructing a global panoramic point cloud mathbfPpan by aligning monocular depth maps from subdivided perspective views using an optimized LSMR solver. A hybrid filtering strategy, combining vision-language grounding and depth discontinuity removal, is applied to enhance geometric quality. Concurrently, semantic parsing identifies key landmarks and obstacles, with their 3D centroids localized to create 3D masks. A Navigation Mesh (NavMesh) is also constructed using Recast Navigation to define traversable regions. This NavMesh is refined through geometric corrections, such as snapping vertices to the ground and eroding boundaries, to ensure physically plausible camera movement. Based on the parsed scene, WorldNav designs five heuristic trajectory modes: Regular, Surrounding, Reconstruct-Aware, Wandering, and Aerial. Regular trajectories provide a general orbital expansion from the origin. Surrounding trajectories circle significant objects at an adaptive radius to ensure foreground quality. Reconstruct-Aware trajectories are designed to target under-observed regions, identified by degenerate mesh faces, to mitigate gaps in the subsequent 3D reconstruction. Wandering trajectories simulate autonomous agent exploration to reach environmental boundaries, while Aerial trajectories are an auxiliary mode to eliminate blind viewpoints by applying an upward pitch to other trajectories.

The third stage, World Expansion, is handled by WorldStereo 2.0, which synthesizes extensive novel views for world expansion. This stage is structured in three progressive training stages to enable camera control, memory-based consistency, and fast inference. The core of WorldStereo 2.0 is its keyframe latent space, which rethinks the limitations of standard Video-VAEs. Instead of spatio-temporal compression, it uses a Keyframe-VAE that applies spatial-only compression to each keyframe independently. This approach preserves high-frequency details and reduces artifacts caused by motion blur and geometric distortion, which are common in standard VAEs. Explicit camera control is achieved by integrating a lightweight transformer-based camera adapter with the pre-trained video DiT. This adapter uses both camera Plücker rays and point clouds as complementary guidance. The point cloud mathbfPref from the reference view is warped into each target view to obtain mathbfPitar, which is then rendered and encoded into latent features to guide the generation.
![Keyframe-VAE in WorldStereo 2.0 versus a standard Video-VAE [64]. Unlike (a) Video-VAE, which performs spatio-temporal compression, (b) Keyframe-VAE applies spatial-only compression to better preserve high-frequency details and reduce artifacts essentially caused by Video-VAE encoding (e.g., motion blur and geometric distortion).](https://api-rsrc.hyper.ai/paper2blog/991ed478-56a8-4c73-ae44-be1bc7697765/2604.14268/cache/tex_resource/monkeyocr/images/0205ffa4833062126c52ee183f527f4db88fec7397bd0289dd6a19e1a22f3c9c.jpg)
To ensure frame consistency across diverse trajectories, WorldStereo 2.0 incorporates two complementary memory modules. The Global-Geometric Memory (GGM) maintains globally consistent coarse scene structure. It is fine-tuned using videos rendered by extended global point clouds mathbfPglo, which include not only the reference points but also additional points sampled from novel views, thereby forcing the model to adhere more strictly to the 3D representations. The Spatial-Stereo Memory (SSM++) reinforces local correspondence and fine-grained details. It is advanced by directly integrating retrieved keyframes into the main DiT branch, modifying the Rotary Positional Embedding (RoPE) to accommodate this integration, and employing a selective retrieval strategy to reduce computational overhead. During mid-training, the model is trained with full self-attention, allowing it to learn global context. For inference, perspective views from the input panorama serve as the initial memory bank, which is incrementally updated with generated keyframes.

The final stage, World Composition, integrates the generated data into a unified 3D representation. This process starts with the reconstruction of globally aligned point clouds and depth maps using the upgraded WorldMirror 2.0 model. WorldMirror 2.0 is a unified feed-forward model that takes multi-view images and optional geometric priors as input and predicts various 3D attributes in a single forward pass. It introduces several key improvements over its predecessor: normalized position encoding for flexible resolution inference, explicit normal-based supervision for depth via a depth-to-normal loss, and a dedicated depth mask prediction head for robust handling of invalid pixels. The depth maps estimated by WorldMirror 2.0 are then aligned with the initial panoramic point cloud mathbfPpan to resolve scale ambiguity. This alignment is achieved through a RANSAC-based linear transformation, which is refined using an outlier detection and revision strategy based on the global distribution of alignment coefficients.

Finally, the expanded point cloud is used to initialize a 3D Gaussian Splatting (3DGS) model. The optimization process employs a dual strategy to resolve the dilemma between rendering efficiency and detail preservation. It applies a standard growth strategy exclusively to the non-sky portion of the point cloud to densify texture-rich regions, while integrating a MaskGaussian mechanism to dynamically prune redundant Gaussians in over-populated areas like the sky. This probabilistic masking, regularized by a squared loss, allows for implicit sparsity and effectively suppresses floating artifacts. The training objective combines photometric and geometric losses, with depth supervision applied sparsely to the aligned depth maps and dense normal supervision applied across all frames. The optimized 3DGS model can be further processed to extract a mesh for applications requiring collision detection or physics simulation.
Experiment
The evaluation assesses the HY-World 2.0 pipeline through component-wise analysis of panorama generation, trajectory planning, scene reconstruction, and 3DGS composition, alongside a standalone evaluation of the WorldMirror 2.0 reconstruction foundation. The results demonstrate that the system produces highly coherent, aesthetically pleasing, and geometrically consistent 3D worlds that strictly adhere to text or image inputs while outperforming commercial and state-of-the-art models. Furthermore, the framework exhibits robust multi-resolution generalization and efficient scaling through optimized inference techniques, enabling the creation of high-fidelity, interactive 3D environments.
The the the table presents a comparison of single-view 3D reconstruction methods on two benchmarks, Tanks-and-Temples and MipNeRF360. Results show that the proposed WorldStereo 2.0 method achieves the highest performance across most metrics on both datasets, with particularly strong improvements in precision and F1-score. The method also demonstrates superior geometric consistency and physically plausible 3D structures compared to existing approaches. WorldStereo 2.0 achieves the best scores on most metrics for single-view 3D reconstruction It shows significant improvements in precision and F1-score over prior methods The results confirm the ability to synthesize highly consistent and plausible 3D structures

The the the table presents an ablation study on trajectory planning components, showing the maximum number of views generated by each trajectory type and whether they are attached to objects or iterative. Regular and surrounding trajectories generate the highest numbers, with surrounding views being attached to objects and iterative. Aerial and wandering trajectories generate fewer views, with none attached to objects or iterative. Regular and surrounding trajectories generate the most views, with surrounding views attached to objects. Aerial and wandering trajectories generate fewer views and are not attached to objects or iterative. Surrounding and recon-aware trajectories are iterative, while regular and wandering are not.

The the the table compares the performance of WorldMirror 1.0 and WorldMirror 2.0 across different datasets and resolutions. Results show that WorldMirror 2.0 consistently outperforms WorldMirror 1.0, particularly at higher resolutions, and benefits from the integration of geometric priors. The improvements are most pronounced in the 7-Scenes dataset. WorldMirror 2.0 achieves better performance than WorldMirror 1.0 across all datasets and resolutions. The integration of geometric priors significantly improves results for both versions, with the largest gains seen in WorldMirror 2.0. WorldMirror 2.0 maintains stable performance at high resolutions, while WorldMirror 1.0 degrades significantly.

The the the table presents a detailed runtime analysis of the HY-World 2.0 pipeline, showing the time taken for each major stage. The results indicate that world expansion is the most time-consuming component, followed by trajectory planning, while panorama generation is the fastest step. The total runtime for the complete pipeline is summarized, providing insight into the computational efficiency of the system. World expansion is the most time-consuming stage in the pipeline. Trajectory planning takes significantly longer than panorama generation. The total runtime for the full world generation process is 712 seconds.

The the the table compares camera pose estimation performance across different methods and resolutions. WorldMirror 2.0 achieves the highest AUC@30 score at high resolution, demonstrating improved accuracy over previous versions and other methods. It also shows consistent improvements in RTA@30 and depth metrics across resolutions. WorldMirror 2.0 achieves the highest AUC@30 score at high resolution WorldMirror 2.0 maintains superior performance across all metrics compared to WorldMirror 1.0 WorldMirror 2.0 shows consistent improvements in RTA@30 and depth metrics across resolutions

The evaluation compares WorldStereo 2.0 and WorldMirror 2.0 against existing methods and previous versions through benchmarks on 3D reconstruction, trajectory planning, point map reconstruction, camera pose estimation, and pipeline runtime. The results demonstrate that WorldStereo 2.0 produces more geometrically consistent and physically plausible 3D structures, while WorldMirror 2.0 benefits significantly from geometric priors to maintain stable performance at high resolutions. Additionally, the ablation study identifies the most effective trajectory types for view generation, and the runtime analysis highlights world expansion as the primary computational bottleneck in the pipeline.