HyperAIHyperAI

Command Palette

Search for a command to run...

WildDet3D: 야생 환경에서의 Promptable 3D Detection 스케일링

초록

단일 이미지로부터 3D 객체를 이해하는 것은 공간 지능(spatial intelligence)의 핵심 초석입니다. 이 목표를 달성하기 위한 핵심 단계는 단안 3D 객체 탐지(monocular 3D object detection)로, 입력된 RGB 이미지로부터 객체의 범위, 위치 및 방향을 복원하는 것입니다. 오픈 월드(open world)에서 실용적으로 활용되기 위해서는 이러한 탐지기가 폐쇄형 세트(closed-set) 카테고리를 넘어 일반화될 수 있어야 하며, 다양한 prompt modality를 지원하고, 가능한 경우 기하학적 단서(geometric cues)를 활용할 수 있어야 합니다.현재 기술의 발전은 두 가지 병목 현상으로 인해 저해되고 있습니다. 첫째, 기존 방법론들은 단일 prompt 유형에 맞춰 설계되어 있으며 추가적인 기하학적 단서를 통합할 수 있는 메커니즘이 부족합니다. 둘째, 현재의 3D 데이터셋은 통제된 환경 내의 좁은 카테고리만을 다루고 있어 오픈 월드 전이(open-world transfer)에 한계가 있습니다.본 연구에서는 이러한 두 가지 간극을 모두 해결합니다. 먼저, 텍스트, point, box prompt를 네이티브하게 수용하며 inference 시점에 보조적인 depth 신호를 통합할 수 있는 통합 기하학 인지 아키텍처인 WildDet3D를 소개합니다. 둘째, 현재까지 가장 큰 규모의 오픈 3D 탐지 데이터셋인 WildDet3D-Data를 제시합니다. 이 데이터셋은 기존의 2D annotation으로부터 후보 3D box를 생성하고 인간이 검증한 것만을 유지하는 방식으로 구축되었으며, 다양한 실제 환경 장면에서 13.5K 카테고리에 걸쳐 100만 개 이상의 이미지를 포함합니다.WildDet3D는 여러 benchmark 및 설정에서 새로운 SOTA(state-of-the-art)를 달성했습니다. 오픈 월드 설정에서, 본 연구에서 새롭게 도입한 WildDet3D-Bench의 텍스트 및 box prompt 환경에서 각각 22.6/24.8 AP3D를 기록했습니다. Omni3D에서는 텍스트 및 box prompt를 통해 각각 34.2/36.4 AP3D에 도달했습니다. Zero-shot 평가에서는 Argoverse 2와 ScanNet에서 각각 40.3/48.9 ODS를 달성했습니다. 특히, inference 시점에 depth 단서를 통합함으로써 모든 설정에서 평균적으로 상당한 추가 이득(+20.7 AP)을 얻었습니다.

One-sentence Summary

To enable scalable and open-world monocular 3D object detection, the authors introduce WildDet3D, a unified geometry-aware architecture that supports text, point, and box prompts while incorporating auxiliary depth signals, and WildDet3D-Data, a dataset of over 1M images across 13.5K categories that allows the model to establish a new state of the art across multiple benchmarks.

Key Contributions

  • The paper introduces WildDet3D, a unified geometry-aware architecture that supports text, point, and box prompts while incorporating auxiliary depth signals through a specialized depth fusion module.
  • This work presents WildDet3D-Data, a large-scale dataset containing over 1M images across 13.5K categories, which was constructed using a multi-model candidate generation pipeline followed by human and VLM verification.
  • Experimental results demonstrate that WildDet3D achieves state-of-the-art performance on the Omni3D benchmark and shows strong zero-shot generalization across diverse datasets like Argoverse 2 and ScanNet.

Introduction

Monocular 3D object detection is essential for spatial intelligence in applications like robotics, AR/VR, and mobile devices. However, existing methods are often limited to closed-set categories and fixed interaction modes, typically supporting only a single type of prompt. Furthermore, current 3D datasets are often restricted to narrow categories and controlled environments, which hinders open-world generalization. The authors address these challenges by introducing WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts while allowing for the integration of auxiliary depth signals at inference time. To support this model, they also present WildDet3D-Data, a massive dataset containing over 1 million human-verified images across 13.5K categories to enable robust open-vocabulary 3D perception.

Dataset

Dataset overview
Dataset overview

WildDet3D-Data Overview

The authors introduce WildDet3D-Data, a large-scale dataset designed for open-vocabulary 3D detection in diverse, real-world environments. It features over 1M images, 3.7M valid 3D annotations, and 13.5K object categories, representing a 138x increase in category coverage compared to existing datasets like Omni3D.

Dataset Composition and Sources The dataset is built upon dense 2D annotations from four primary large-scale sources:

  • COCO: 118K training and 5K validation images.
  • LVIS: COCO images with long-tail annotations covering over 1,200 categories.
  • Objects365: 609K training and 30K validation images across 365 categories.
  • V3Det: 183K training and 30K validation images.

Data Processing and Candidate Generation To lift 2D annotations into 3D space, the authors employ a multi-stage pipeline:

  • Geometric Lifting: Images undergo 4x super-resolution before metric depth and camera intrinsics are estimated. Candidate 3D boxes are then generated using five complementary methods: 3D-MOOD, DetAny3D, SAM-3D, RANSAC-PCA, and LabelAny3D.
  • Refinement: Initial candidates undergo translation and rotation optimization to align with estimated depth maps and 2D projection constraints.
  • Multi-Stage Filtering:
    • Geometric Filters: Candidates are removed based on edge contact, occlusion ratios, or unrealistic 3D-to-2D projection sizes.
    • Semantic Filters: A VLM (Qwen3.5-9B) removes depicted objects (e.g., pictures or reflections) and composite images.
    • Size and Geometry Filters: GPT-4o-mini estimates physical dimensions to filter out implausible scales, depth-to-width ratios, and axis proportions.
  • Final Selection: The authors use two paths to finalize annotations:
    • Human Selection: A subset of ~103K images is verified by crowdsourced annotators who rate candidate quality.
    • VLM Selection: For the remaining ~896K images, a fine-tuned Molmo2 model automatically selects the best candidate based on six perceptual criteria.

Training and Evaluation Usage The authors use the data in a three-stage training curriculum:

  • Stage 1: Initial training on Omni3D.
  • Stage 2: Fine-tuning on a mixture of Omni3D, WildDet3D-Data (both human and synthetic subsets), and supplementary datasets (CA-1M, Waymo, 3EED, and FoundationPose).
  • Stage 3: Final fine-tuning on Omni3D and the human-annotated portion of WildDet3D-Data using mask-guided point and box training.

For evaluation, the authors construct WildDet3D-Bench, an in-the-wild benchmark containing 700+ open-vocabulary categories. This benchmark uses a balanced sampling strategy to ensure coverage across rare, common, and frequent categories.

Method

The WildDet3D framework is designed to perform 3D object detection from a single RGB image, optionally augmented with camera intrinsics and depth information, guided by a user-specified prompt. The architecture is structured around three primary components: a dual-vision encoder system that processes visual and geometric inputs, a promptable detector that conditions detection on diverse prompt types, and a 3D detection head that produces metric 3D bounding boxes with unambiguous orientation. An overview of the framework is shown in Figure 3, which illustrates the modular flow from input through feature extraction and fusion to multi-task prediction.

The dual-vision encoder system decouples semantic and geometric feature extraction to address the inherent trade-off between detection quality and metric depth estimation. It consists of an image encoder and an RGBD encoder. The image encoder, a Vision Transformer (ViT-H) with a SimpleFPN neck, is initialized from a segmentation-pretrained checkpoint and extracts high-resolution, multi-scale semantic features. The RGBD encoder, built on a DINOv2 ViT-L/14 backbone, processes the same image along with an optional depth map, producing depth latents through a convolutional neck. These two encoders operate independently, allowing the architecture to leverage different pretrained models optimized for their respective tasks—semantic segmentation for the image encoder and metric depth estimation for the RGBD encoder. The depth fusion module, highlighted in yellow in Figure 3, merges these two streams by injecting depth latents into the image encoder's feature maps. This is achieved through a residual connection where the depth latents, after being bilinearly upsampled to match the visual feature resolution and normalized via LayerNorm, are projected to the visual feature dimension using a zero-initialized 1×1 convolution. This design ensures that the pretrained visual features remain stable during training, with the depth contribution being gradually learned.

Overview of the WildDet3D architecture
Overview of the WildDet3D architecture

The promptable detector, depicted as the purple block in Figure 3, unifies various input prompt types into a single representation for the detection heads. It accepts four prompt modalities: text, point, box, and exemplar. Each prompt type is encoded separately: text prompts are tokenized and passed through a causal text Transformer, while geometric prompts (point and box) are encoded by summing a direct coordinate projection, ROI-aligned features, and sinusoidal positional encoding, refined by a cross-attention Transformer. Exemplar prompts use a similar encoding but are distinguished by a special token and a multi-target matching strategy. The encoded tokens from all prompt types are concatenated into a single sequence, which acts as cross-attention memory in the subsequent detection stages. This component operates on a per-prompt batching strategy, where training batches are constructed around unique prompt instances rather than images, enabling fine-grained supervision and handling an arbitrary number of categories per image.

Dual-vision encoder and depth fusion module
Dual-vision encoder and depth fusion module

The 3D detection head, shown in red in Figure 3, is responsible for generating the final 3D bounding box predictions. It takes the query features from the promptable detector and enriches them with multi-source information. For each decoder layer, it first incorporates camera geometry by generating per-pixel ray directions from the camera intrinsics and encoding them using 8th-order real spherical harmonics. This ray feature is fused via cross-attention. Subsequently, it fuses depth latents from the RGBD encoder using another cross-attention module. The fused query features are then passed through a two-layer MLP to predict a 12-dimensional encoding of the 3D box, which includes center offset, log-depth, log-dimensions, and a 6D rotation representation. To resolve the inherent ambiguity in 3D box orientation, a two-step unambiguous rotation normalization is applied to both ground truth and predictions: dimensions are ordered such that width is less than or equal to length, and the yaw angle is folded into the interval [0,π)[0, \pi)[0,π). This normalization ensures a one-to-one mapping between box geometry and the regression target. The 3D center is recovered at inference by back-projecting the predicted offset and depth. A parallel confidence branch, also a two-layer MLP, predicts a scalar score s3D[0,1]s_{\text{3D}} \in [0, 1]s3D[0,1], which is trained with an IoU-aware focal BCE loss using a soft target that combines depth prediction quality and 3D IoU. The final detection score is a weighted sum of the 2D objectness score and the 3D confidence.

3D detection head with multi-source information aggregation
3D detection head with multi-source information aggregation

Experiment

The researchers evaluate WildDet3D through extensive testing on a new in-the-wild benchmark, standard datasets like Omni3D, and zero-shot transfer tasks to validate its open-vocabulary capabilities and geometric accuracy. The experiments demonstrate that the model significantly outperforms existing methods in detecting diverse, long-tailed object categories and generalizes effectively across different environments. Furthermore, the results show that the architecture successfully leverages optional depth cues to resolve scale ambiguity and provides a versatile foundation for real-world applications in robotics, AR/VR, and mobile computing.

The model uses a three-stage training pipeline, starting from scratch and progressing through data mixing and mask-guided training. Each stage employs specific data combinations and learning rate schedules to gradually improve performance. The training begins from scratch and proceeds in three stages with increasing complexity. Stage 2 combines multiple datasets with a specific data mixing ratio and uses the output of Stage 1 as initialization. Stage 3 uses a mask-guided approach with a different data mix and further refines the model from Stage 2.

Training pipeline stages
Training pipeline stages

The authors evaluate WildDet3D on Argoverse 2 and ScanNet benchmarks, showing significant improvements over prior methods in detection and geometric accuracy. The model achieves higher AP and ODS scores while reducing translation, scale, and orientation errors, particularly when depth information is available. WildDet3D achieves superior detection and geometric accuracy compared to baselines on Argoverse 2 and ScanNet. The model reduces translation, scale, and orientation errors, improving localization precision. Performance gains are more pronounced when depth information is incorporated, especially on ScanNet.

WildDet3D outperforms baselines
WildDet3D outperforms baselines

The authors evaluate WildDet3D on WildDet3D-Bench, demonstrating significant improvements over baselines across different training data and prompt modalities. Results show that incorporating additional data and ground-truth depth leads to substantial performance gains, particularly for rare and common categories. WildDet3D achieves the highest performance on WildDet3D-Bench across all categories and prompt types. The inclusion of additional training data and ground-truth depth significantly improves detection accuracy. Performance gains are most pronounced on rare and common categories, highlighting strong generalization to unseen classes.

WildDet3D performance on WildDet3D-Bench
WildDet3D performance on WildDet3D-Bench

The authors evaluate WildDet3D on multiple benchmarks, showing superior performance compared to existing methods. Results indicate that incorporating depth information significantly enhances detection accuracy, particularly in the box prompt setting. The model achieves strong results across various datasets, demonstrating generalization and robustness. WildDet3D achieves higher AP3D than all baselines across multiple datasets. Performance improves substantially when depth is provided, especially in the box prompt setting. The model shows consistent gains on both text and box prompt settings, with the largest improvements on rare categories.

WildDet3D outperforms baselines
WildDet3D outperforms baselines

The the the table presents validation results for the annotation pipeline, showing how different candidate models are selected and rejected by human annotators, and how VLM scores correlate with human judgment. The data indicates that model quality varies significantly across candidates and that VLM scoring effectively predicts human acceptance, though it cannot fully replace human evaluation. Human annotators select and reject candidate models at rates that vary substantially across different methods. VLM scores show a perfect monotonic correlation with human rejection rates, indicating strong predictive power. Despite strong correlation, VLM scoring alone cannot replace human judgment due to a significant gap in rejection rates even at high scores.

Pipeline validation results
Pipeline validation results

The model is evaluated through a multi-stage training pipeline and tested across various benchmarks including Argoverse 2, ScanNet, and WildDet3D-Bench to validate detection accuracy and geometric precision. The results demonstrate that incorporating depth information and diverse training data significantly enhances localization and generalization, particularly for rare categories. Additionally, an annotation pipeline validation shows that while VLM scores correlate strongly with human judgment, they serve as a predictive tool rather than a complete replacement for human evaluation.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp