HyperAIHyperAI

Command Palette

Search for a command to run...

깊이 임의 전경: 전경 깊이 추정을 위한 기반 모델

Xin Lin Meixi Song Dizhe Zhang Wenxuan Lu Haodong Li Bo Du Ming-Hsuan Yang Truong Nguyen Lu Qi

초록

본 연구에서는 다양한 장면 거리에 걸쳐 일반화가 가능한 패노라마 메트릭 깊이 기반 모델을 제안한다. 데이터 구축과 아키텍처 설계의 관점에서 데이터-인-더-루프 패러다임을 탐구한다. 공개 데이터셋, UE5 시뮬레이터를 통해 생성한 고품질 합성 데이터, 텍스트-이미지 모델을 활용한 데이터, 그리고 웹에서 수집한 실제 패노라마 이미지를 결합하여 대규모 데이터셋을 구축한다. 실내/실외 및 합성/실제 데이터 간의 도메인 갭을 줄이기 위해, 레이블이 없는 이미지에 대한 신뢰할 수 있는 지표를 생성하기 위해 세 단계의 가상 레이블 정제 파이프라인을 도입한다. 모델 측면에서는 강력한 사전 훈련된 일반화 능력을 지닌 DINOv3-Large를 백본으로 채택하고, 즉시 사용 가능한 거리 마스크 헤드, 선명도 중심 최적화, 기하학 중심 최적화를 도입하여 다양한 거리에 대한 견고성과 다양한 시점 간의 기하학적 일관성을 강화한다. Stanford2D3D, Matterport3D, Deep360 등 여러 벤치마크에서의 실험 결과는 뛰어난 성능과 제로샷 일반화 능력을 입증하며, 다양한 실제 환경에서 특히 견고하고 안정적인 메트릭 예측이 가능함을 보여준다. 프로젝트 페이지는 다음 링크에서 확인할 수 있다: https://insta360-research-team.github.io/DAP_website/ {https://insta360-research-team.github.io/DAP_website/}

One-sentence Summary

Insta360 Research, UC San Diego, and Wuhan University propose DAP, a panoramic metric depth foundation model using a data-in-the-loop paradigm that fuses synthetic and real data with a three-stage pseudo-label pipeline. Leveraging DINOv3-Large with range mask head and geometry optimizations, it achieves robust zero-shot depth estimation across diverse real-world scenes, outperforming benchmarks like Stanford2D3D through enhanced distance robustness and cross-view consistency.

Key Contributions

  • The work addresses the challenge of poor generalization in panoramic depth estimation across diverse real-world scenes, particularly outdoors, due to limited and non-diverse training data that hinders robust metric depth prediction for applications like robotic navigation.
  • It introduces a three-stage pseudo-label curation pipeline that bridges indoor-outdoor and synthetic-real domain gaps by progressively refining ground truth for unlabeled images, alongside a foundation model with a plug-and-play range mask head and geometry/sharpness-centric optimizations to enforce metric consistency across varying distances.
  • Evaluated on Stanford2D3D, Matterport3D, and Deep360 benchmarks, the model demonstrates strong zero-shot generalization and robust metric depth predictions in complex real-world scenarios, achieving state-of-the-art performance with perceptually coherent and scale-consistent outputs.

Introduction

Panoramic depth estimation enables critical spatial intelligence for robotics, such as omnidirectional obstacle avoidance, by capturing full 360°×180° environmental coverage. However, existing methods—both panorama-specific relative-depth approaches and unified metric-depth frameworks—struggle to generalize across diverse real-world scenes, particularly outdoors, due to limited training data scale and diversity caused by high collection and annotation costs. The authors address this by introducing a data-in-the-loop paradigm that constructs a 2M+ sample dataset spanning synthetic and real domains, including curated indoor data, photorealistic outdoor simulations, and internet-sourced panoramas. They further develop a three-stage pseudo-label curation pipeline to bridge indoor-outdoor and synthetic-real domain gaps, alongside a foundation model incorporating geometry-aware losses and a plug-and-play depth-range mask for metric-consistent, scale-invariant depth estimation across complex real-world scenarios.

Dataset

The authors introduce the DAP data engine, a unified dataset scaling to 2 million panoramas across synthetic/real and indoor/outdoor domains. Key components include:

  • DAP-2M-Labeled (90K samples):

    • Source: High-fidelity Airsim360 simulations of drone flights.
    • Content: Pixel-aligned depth maps and panoramas from 5 outdoor scenes (e.g., NYC, Rome), covering 26,600+ sequences.
    • Filtering: None specified beyond simulation parameters.
  • DAP-2M-Unlabeled (1.7M samples):

    • Source: Internet-sourced panoramic videos (250K videos → 1.7M frames after processing).
    • Filtering: Horizon validation to remove invalid samples; categorized via Qwen2-VL into 250K indoor and 1.45M outdoor.
    • Augmentation: 200K additional indoor samples generated via DiT-360 to address scarcity.

The paper uses this data in a three-stage semi-supervised pipeline:

  1. Stage 1: Trains a Scene-Invariant Labeler on DAP-2M-Labeled for initialization.
  2. Stage 2: Uses a PatchGAN discriminator to select 600K high-confidence pseudo-labeled samples (300K indoor/outdoor each) from DAP-2M-Unlabeled, bridging synthetic-real domain gaps.
  3. Stage 3: Trains the final model on all labeled and pseudo-labeled data (2M total), enabling cross-domain generalization.

Processing details include horizon-based filtering for real data, Qwen2-VL for scene categorization, and DiT-360 for synthetic indoor data expansion. No cropping strategies are mentioned; depth maps remain pixel-aligned for synthetic data.

Method

The authors leverage a three-stage pseudo-label curation pipeline to construct a large-scale, high-quality training dataset for panoramic metric depth estimation, effectively bridging domain gaps between synthetic and real, indoor and outdoor environments. The pipeline begins with Stage 1, where a Scene-Invariant Labeler is trained on 20k synthetic indoor and 90k synthetic outdoor images with accurate metric depth annotations. This stage aims to learn depth cues that generalize across diverse scene geometries and lighting conditions, providing a robust foundation for pseudo-label generation. As shown in the framework diagram, the Scene-Invariant Labeler is then applied to 1.9M unlabeled real panoramic images to generate initial depth predictions.

In Stage 2, a depth quality discriminator is pre-trained to distinguish between synthetic ground-truth depth maps (real) and Scene-Invariant Labeler outputs (fake), enabling scene-agnostic quality assessment. The top 300K indoor and 300K outdoor real images, as ranked by the discriminator, are selected as high-confidence pseudo-labeled data. These are combined with the synthetic datasets to train a Realism-Invariant Labeler, which learns to generalize beyond synthetic textures and lighting, improving robustness to real-world appearance variations. Stage 3 then trains the final DAP model on the full 1.9M pseudo-labeled dataset, along with previously labeled data, enabling dense supervision and strong generalization.

For the model architecture, the authors adopt DINOv3-Large as the visual backbone, chosen for its strong pre-trained generalization capabilities. The network features a dual-head design: a metric depth head that predicts a dense depth map DDD, and a plug-and-play range mask head that outputs a binary mask MMM defining valid spatial regions under predefined distance thresholds (10 m, 20 m, 50 m, 100 m). The final metric depth output is computed as MDM \odot DMD, ensuring physical validity and scale consistency across varying scene scales. The range mask head is optimized using a combination of weighted BCE and Dice losses:

Lmask=MMgt2+0.5LDice(M,Mgt),\mathcal{L}_{\text{mask}} = \| M - M_{\text{gt}} \|^{2} + 0.5 \, \mathcal{L}_{\text{Dice}}(M, M_{\text{gt}}),Lmask=MMgt2+0.5LDice(M,Mgt),

where MMM and MgtM_{\text{gt}}Mgt denote predicted and ground-truth masks.

Training is guided by a multi-objective loss function that integrates both geometry-centric and sharpness-centric optimizations. The geometry-centric component includes the normal loss Lnormal\mathcal{L}_{\text{normal}}Lnormal, which computes the L1 distance between predicted and ground-truth surface normals npred,ngtRH×W×3\mathbf{n}_{\text{pred}}, \mathbf{n}_{\text{gt}} \in \mathbb{R}^{H \times W \times 3}npred,ngtRH×W×3:

Lnormal=npred(i,j)ngt(i,j)1.\mathcal{L}_{\text{normal}} = \lVert \mathbf{n}_{\text{pred}}(i, j) - \mathbf{n}_{\text{gt}}(i, j) \rVert_{1}.Lnormal=npred(i,j)ngt(i,j)1.

Additionally, a point cloud loss Lpts\mathcal{L}_{\text{pts}}Lpts is applied by projecting depth maps into spherical coordinates to obtain 3D point clouds Ppred,PgtRH×W×3\mathbf{P}_{\text{pred}}, \mathbf{P}_{\text{gt}} \in \mathbb{R}^{H \times W \times 3}Ppred,PgtRH×W×3:

Lpts=Ppred(i,j)Pgt(i,j)1.\mathcal{L}_{\text{pts}} = \| \mathbf{P}_{\text{pred}}(i, j) - \mathbf{P}_{\text{gt}}(i, j) \|_{1}.Lpts=Ppred(i,j)Pgt(i,j)1.

The sharpness-centric component introduces a dense fidelity loss LDF\mathcal{L}_{\text{DF}}LDF, which decomposes the equirectangular projection into 12 perspective patches using virtual cameras positioned at the vertices of an icosahedron. For each patch, a Gram-based similarity is computed between predicted and ground-truth depth maps:

LDF=1Nk=1NDpred(k)Dpred(k)TDgt(k)Dgt(k)TF2,\mathcal{L}_{\text{DF}} = \frac{1}{N} \sum_{k=1}^{N} \left\| D_{\text{pred}}^{(k)} \odot D_{\text{pred}}^{(k)\textnormal{T}} - D_{\text{gt}}^{(k)} \odot D_{\text{gt}}^{(k)\textnormal{T}} \right\|_{F}^{2},LDF=N1k=1NDpred(k)Dpred(k)TDgt(k)Dgt(k)TF2,

where N=12N=12N=12. To preserve edge fidelity directly in the ERP domain, a gradient-based loss Lgrad\mathcal{L}_{\text{grad}}Lgrad is applied within edge regions masked by Sobel operators:

Lgrad=LSILog(MEDpred,MEDgt).\mathcal{L}_{\text{grad}} = \mathcal{L}_{\text{SILog}}(M_{E} \odot D_{\text{pred}}, \, M_{E} \odot D_{\text{gt}}).Lgrad=LSILog(MEDpred,MEDgt).

The overall training objective is a weighted sum of all losses, modulated by a distortion map MdistortM_{\text{distort}}Mdistort to account for non-uniform pixel distribution in equirectangular projections:

Ltotal=Mdistort(λ1LSILog+λ2LDF+λ3Lgrad+λ4Lnormal+λ5Lpts+λ6Lmask),(=1,2,3,4,5,6).\begin{array}{r} \mathcal{L}_{\text{total}} = M_{\text{distort}} \odot \big( \lambda_{1} \mathcal{L}_{\text{SILog}} + \lambda_{2} \mathcal{L}_{\text{DF}} + \lambda_{3} \mathcal{L}_{\text{grad}} \\ + \lambda_{4} \mathcal{L}_{\text{normal}} + \lambda_{5} \mathcal{L}_{\text{pts}} + \lambda_{6} \mathcal{L}_{\text{mask}} \big), \quad \quad (\ell = 1, 2, 3, 4, 5, 6). \end{array}Ltotal=Mdistort(λ1LSILog+λ2LDF+λ3Lgrad+λ4Lnormal+λ5Lpts+λ6Lmask),(=1,2,3,4,5,6).

As shown in the architecture diagram, this multi-faceted optimization strategy ensures metric accuracy, edge fidelity, and geometric consistency across panoramic views.

The authors further incorporate a distortion-aware depth decoder to mitigate stretching artifacts near the poles, enhancing the fidelity of depth predictions in high-distortion regions. This comprehensive design enables DAP to adapt robustly to diverse spatial scales and scene types, delivering stable and accurate metric depth estimation across real-world panoramic environments.

Experiment

  • DAP model evaluated on Stanford2D3D, Matterport3D, and Deep360 benchmarks achieves state-of-the-art zero-shot metric depth estimation without fine-tuning, significantly lowering AbsRel and increasing δ₁ compared to prior methods.
  • On the DAP-Test benchmark, DAP reduces AbsRel from 0.2517 to 0.0781 and RMSE from 10.563 to 6.804 while raising δ₁ from 0.6086 to 0.9307 versus previous top methods (DAC and Unik3D).
  • Ablation studies confirm key components: adding sharpness-centric losses (ℒ_DF, ℒ_grad) yields best results (AbsRel 0.1084/0.0862 on Stanford2D3D/Deep360), and the range mask head at 100 m threshold optimizes performance (AbsRel 0.0793, δ₁ 0.9353 on DAP-2M-Labeled).
  • DAP demonstrates robust metric consistency across indoor and outdoor scenes, maintaining accurate depth structures in challenging regions like distant skies and complex layouts where prior methods fail.

The authors evaluate the impact of different depth mask thresholds on model performance, finding that a 100 m threshold yields the best overall balance with the lowest AbsRel and highest δ1 on both DAP-2M-Labeled and Deep360. Removing the mask head degrades performance, confirming its role in stabilizing training and filtering unreliable far-depth predictions. Results show that smaller thresholds (10 m, 20 m) favor near-range accuracy, while the 100 m setting best preserves metric consistency across diverse scene scales.

The authors evaluate DAP against scale-invariant and metric depth estimation methods on three benchmarks, showing that DAP achieves the lowest AbsRel and RMSE while attaining the highest δ1 across all datasets. Results indicate DAP outperforms prior metric methods like Unik3D and DAC, particularly in outdoor scenes on Deep360, where it reduces AbsRel to 0.0659 and δ1 reaches 0.9525. These improvements reflect DAP’s ability to deliver metrically consistent depth without post-alignment, generalizing robustly across indoor and outdoor panoramic environments.

The authors evaluate DAP on the DAP-Test benchmark and show it outperforms DAC and Unik3D across all metrics, achieving a lower AbsRel of 0.0781, lower RMSE of 6.804, and higher δ₁ of 0.9370. These results demonstrate the effectiveness of their large-scale data scaling and unified training framework for metrically consistent panoramic depth estimation.

The authors use DAP to achieve metric depth estimation directly from panoramic inputs without post-alignment, outperforming prior methods across indoor and outdoor benchmarks. Results show DAP’s training on 2 million panorama samples enables strong generalization, with superior performance in AbsRel, RMSE, and δ1 metrics on Stanford2D3D, Matterport3D, and Deep360. The model’s design, including distortion handling and range-aware masking, supports robust, metrically consistent predictions across diverse real-world scenes.

The authors use ablation studies to evaluate the impact of distortion maps, geometry losses, and sharpness losses on panoramic depth estimation. Results show that combining all three components yields the best performance, achieving the lowest AbsRel and highest δ₁ on both Stanford2D3D and Deep360 datasets. The inclusion of sharpness losses contributes most significantly to improved metric accuracy and structural fidelity.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp