Command Palette
Search for a command to run...
ABot-Earth 0.5: Generatives 3D-Modell der Erde
ABot-Earth 0.5: Generatives 3D-Modell der Erde
Zusammenfassung
Wir präsentieren ABot-Earth 0.5, ein generatives 3D-Framework, das darauf ausgelegt ist, aus flächendeckenden, georeferenzierten Satellitenbildern weite, nahtlose 3D-Umgebungen zu synthetisieren. Um dies zu erreichen, schlagen wir ein neuartiges generatives Modell vor, das direkt auf der 3D-Gaussian-Splatting-(3DGS)-Repräsentation formuliert ist. Das Modell wird auf einem vielfältigen Korpus bestehender realer urbaner Rekonstruktionen trainiert und lernt dabei, realistische Geometrien und Texturen zu generieren. Während der Inferenz synthetisiert es neue 3D-Szenen, die ausschließlich von Satellitenbildern konditioniert werden, mit einer skalierbaren Rate von unter 10 Minuten pro Quadratkilometer und zeigt dabei eine außergewöhnliche Realitätsnähe. Das Framework ist auf Zugänglichkeit konzipiert und verfügt über integrierte hierarchische Level-of-Detail-(LOD)-Strukturen, die eine Echtzeit- und interaktive Visualisierung auf webbasierten Karten-Engines ermöglichen. Diese Simulationssandbox mit hoher Wiedergabetreue mildert die Sim-to-Real-Domänenlücke effektiv ab und ermöglicht kritische nachgelagerte Embodied AI-Anwendungen wie die geschlossene UAV-Navigation. Durch die Bereitstellung einer ultrakostenarmen und hocheffizienten Lösung senkt ABot-Earth 0.5 die technischen und finanziellen Hürden für großflächige 3D-Rekonstruktionen erheblich und treibt die Zukunft der globalen digitalen Erdvisualisierung voran.
One-sentence Summary
ABot-Earth 0.5 is a generative 3D framework that synthesizes seamless Earth environments from geospatially referenced satellite imagery using a novel 3D Gaussian Splatting formulation trained on urban reconstructions, generating realistic scenes in under 10 minutes per square kilometer while leveraging integrated hierarchical level-of-detail structures for real-time web visualization and mitigating the sim-to-real domain gap for Embodied AI applications like closed-loop UAV navigation.
Key Contributions
- This work introduces ABot-Earth 0.5, a generative 3D framework that synthesizes vast, seamless urban environments directly from geospatially referenced satellite imagery.
- The approach formulates a novel generative model on the 3D Gaussian Splatting (3DGS) representation, trained on a diverse corpus of real-world urban reconstructions to achieve inference at under 10 minutes per square kilometer. Integrated hierarchical level-of-detail structures further enable real-time interactive visualization on web-based map engines.
- The framework mitigates the synthetic-to-real domain gap by providing a high-fidelity simulation environment that supports closed-loop unmanned aerial vehicle navigation and Embodied AI training.
Introduction
High-fidelity three-dimensional geospatial reconstruction serves as a critical foundation for digital twin infrastructures, smart city logistics, and autonomous system simulation. Traditional reconstruction pipelines built on dense photogrammetry and LiDAR scanning face prohibitive acquisition costs, prolonged processing latencies, and steep computational barriers. Although generative 3D modeling has matured at the object scale, scaling these techniques to unbounded outdoor environments remains difficult because existing models rely heavily on synthetic assets or unconstrained hallucination, creating a severe sim-to-real domain gap. To address these challenges, the authors introduce ABot-Earth 0.5, a generative framework trained directly on high-quality real-world 3D Gaussian Splatting reconstructions. By leveraging ubiquitous satellite imagery as a geospatial conditioning signal and natively generating hierarchical level-of-detail outputs, the model rapidly synthesizes physically authentic, simulation-ready 3D environments at planetary scales. This generative paradigm effectively closes the authenticity gap while dramatically lowering data and computational overhead, paving the way for scalable, cost-effective digital earth applications.
Dataset
Dataset Composition and Sources
- The authors build a real-world, city-scale dataset of 3D Gaussian Splatting (3DGS) scenes sourced from three complementary imagery categories: satellite, aerial, and urban. All inputs are genuine captures rather than synthetic assets, combining proprietary acquisitions with curated public benchmarks. Every source undergoes unified coordinate transformation and metadata standardization before entering the reconstruction pipeline.
Subset Details
- Satellite Imagery: Multi-stereo orbital captures at varying off-nadir angles drawn from public benchmarks like DFC 2019 and proprietary archives. These are processed through a dedicated FromOrbit2Ground module that recovers watertight geometry via Z-Monotonic SDF and synthesizes facade textures using a diffusion restoration network.
- Aerial Data: High-resolution oblique imagery serving as the core training source. It draws from proprietary collections and public datasets such as UrbanScene3D and Mill-19. The reconstruction pipeline optionally incorporates LiDAR point clouds and pre-built photogrammetric meshes as auxiliary geometric priors.
- Urban Imagery: Street-view videos and low-altitude drone footage sourced from public repositories like UC-GS and proprietary feeds. After quality filtering, these ground-level captures are registered and jointly reconstructed with aerial data to enhance facade detail and novel-view quality at low altitudes.
Data Usage and Processing
- The authors convert the reconstructed 3DGS scenes into compact, generation-friendly training tiles using a sliding window strategy. Each tile covers a 200m by 200m region with intentional overlap to preserve boundary context, followed by coordinate normalization and clustering-based removal of floating artifacts.
- Dense multi-view supervision is generated by distributing virtual camera arrays across multiple altitude layers. The system samples oblique views across various compass directions and applies random perturbations to camera position, altitude, pitch, and yaw to maximize viewpoint diversity. Simulated satellite renders are also generated to serve as conditioning inputs for model training.
- A multi-granularity quality assessment pipeline evaluates the data at the tile, view, and dataset levels. Only high-fidelity samples that pass these filters are curated into the final training set.
Method
The authors present ABot-Earth 0.5, a generative 3D framework designed to synthesize vast, seamless 3D environments from satellite imagery, leveraging a novel generative model formulated directly with the 3D Gaussian Splatting (3DGS) representation. The framework's core architecture is built upon a comprehensive pipeline that begins with data collection from diverse sources, including satellite, aerial, and urban imagery. These multi-source inputs are processed through the ABot-3DGS reconstruction engine, which addresses the challenges of scalability, heterogeneous content, and appearance variation. The reconstruction process employs a scalable, hierarchical block-based architecture that partitions city-scale scenes into independently optimizable blocks, enabling efficient GPU cluster parallelism. This framework incorporates geometry and detail optimization strategies, such as depth estimation and multi-view geometric consistency, to ensure high geometric accuracy and fine-grained texture preservation. Scene robustness is achieved through semantics-aware optimization and dynamic removal of transient elements, while cross-view quality enhancement leverages multi-source data fusion to produce photorealistic reconstructions. The resulting high-fidelity 3DGS scenes serve as the foundation for the downstream generative model.
The generative model itself is designed as a native 3DGS framework, operating directly on the 3DGS representation to learn a compact latent space from real-world scenes. This approach allows the model to handle the complexity of real-world environments without the constraints of mesh-based assumptions. A key innovation is the inherent multi-LOD decoder, which is deeply integrated into the generation process to synthesize a hierarchical 3DGS structure. This enables seamless, on-demand generation of appropriate levels of detail, supporting smooth and real-time interactive visualization from planetary overviews to street-level views. To ensure spatial coherence at large scales, the model employs a seamless sliding-window inference strategy. This mechanism intelligently blends overlapping regions during generation, drastically reducing stitching artifacts and enabling the rendering of vast, continuous landscapes. The model also features a cross-domain adaptation strategy to ensure robust conditioning on satellite imagery, which exhibits significant variance in quality and acquisition conditions. This two-stage approach involves simulating satellite renderings during training and using a vision-language model (VLM) to dynamically adapt the conditioning at inference, ensuring high-fidelity generation from any real-world satellite input.
The deployment of ABot-Earth 0.5 as a planetary-scale system relies on a two-stage end-to-end pipeline. The first stage, the Global-Scale 3DGS Production Pipeline, utilizes a tile-based generation strategy to manage the immense computational requirements. The globe is partitioned into regular spatial tiles, with each tile processed independently to fit within the VRAM constraints of inference GPUs. This modular approach allows for the generation of large-scale blocks, which are then processed for georeferencing and input preprocessing to ensure uniform scale and accurate alignment. The second stage, the EarthScape Scalable Rendering Pipeline, addresses the challenges of managing and rendering the colossal dataset. It begins with geographic alignment, transforming each block into a unified coordinate system (EPSG:3857) and establishing an ENU local tangent plane for precise rendering. This is followed by extensive LOD data reorganization, which re-partitions the Gaussians into a standard map tile hierarchy, generating a multi-level LOD structure from zoom level 14 to 19. The highest precision levels are generated natively by the inference model, while lower levels are efficiently decimated using a statistical scheme based on the Bhattacharyya distance. This process leverages heterogeneous compute resources to minimize latency. The pipeline culminates in rendering scheduling, where the organized data is integrated with the Amap Yunjing rendering engine. The engine's existing frustum culling and asynchronous streaming capabilities are leveraged to dynamically load tiles based on the camera's viewport, enabling real-time, interactive rendering of a trillion-scale global 3DGS dataset.
Experiment
The evaluation framework assesses the proposed method through two complementary perspectives: generative fidelity against academic baselines and system-level applicability compared to leading commercial platforms. Analysis of generative fidelity demonstrates the model's superior capability to capture photorealistic details while enabling continuous, planetary-scale 3D environment creation that overcomes the spatial constraints of traditional survey-based approaches. In terms of system-level performance, the framework exhibits significantly faster deployment cycles and broader geographic coverage, while human assessments indicate stronger overall aesthetic appeal despite currently trailing established commercial pipelines in precise geometric reconstruction. Collectively, these qualitative findings validate the method as a highly scalable and timely solution for real-world digital twin applications, with its generative capabilities expected to progressively close the quality gap with industry standards.
{"summary": "The authors compare their system, ABot-Earth 0.5, with commercial solutions Google Earth and Marble across key system-level dimensions. Results show that ABot-Earth 0.5 offers a generative paradigm with infinite coverage and an open platform, contrasting with the reconstruction-based approach and limited coverage of Google Earth, and the closed nature of Marble.", "highlights": ["ABot-Earth 0.5 provides infinite coverage through a generative approach, unlike the sparse, scanned-region coverage of Google Earth.", "ABot-Earth 0.5 operates on an open platform, offering greater accessibility compared to the API-only access of Google Earth and the closed nature of Marble.", "The system leverages a generative paradigm, enabling broader spatial reach and faster creation of 3D environments compared to traditional reconstruction methods."]
{"summary": "The authors evaluate their method's generative fidelity against existing baselines using standard metrics, demonstrating superior performance in generating realistic outdoor scenes. They also compare their system's applicability with commercial solutions, highlighting advantages in coverage, efficiency, and visual quality.", "highlights": ["The proposed method achieves significantly better generative fidelity compared to existing baselines, as indicated by lower FID and KID scores.", "The system outperforms commercial solutions in spatial coverage and scalability, enabling 3D generation across regions where data is otherwise unavailable.", "While commercial systems excel in geometric and textural fidelity, the proposed method achieves higher overall aesthetic quality, suggesting strengths in holistic photorealism."]
The authors evaluate their proposed generative system against commercial platforms and existing baselines to validate spatial coverage, platform accessibility, and scene realism. The first comparison demonstrates that the generative paradigm enables unlimited geographic reach and open accessibility, contrasting with the restricted, scan-dependent nature of commercial alternatives. The second assessment confirms that while traditional reconstruction methods excel in precise geometric and textural accuracy, the proposed approach achieves superior holistic photorealism and scalability, establishing a more adaptable framework for large-scale 3D environment generation.