HyperAIHyperAI

Command Palette

Search for a command to run...

Holi-Spatial: 비디오 스트림을 전체적인 3D 공간 지능으로 진화시키기

초록

공간 지능의 발전은 근본적으로 대규모이고 세밀한 3 차원 데이터에 대한 접근성에 의존합니다. 그러나 기존 연구들은 주로 제한된 수의 수동 주석 데이터셋으로부터 질문 - 답변 (QA) 쌍을 생성하여 공간 이해 벤치마크를 구축할 뿐, 원시 웹 데이터로부터 새로운 대규모 3 차원 장면을 체계적으로 주석화하지는 못했습니다. 그 결과, 이러한 방법들의 확장성은 심각하게 제한될 뿐만 아니라, 좁게 선별된 데이터셋에 내재된 도메인 간격 (domain gap) 으로 인해 모델 성능도 저해받고 있습니다.본 논문에서는 제안된 데이터 큐레이션 파이프라인을 활용하여 원시 비디오 입력만으로 인간의 개입 없이 구축된, 최초로 완전 자동화되고 대규모이며 공간 인식이 가능한 멀티모달 데이터셋인'Holi-Spatial'을 소개합니다. Holi-Spatial 은 렌더링된 깊이 맵을 포함하는 기하학적으로 정확한 3 차원 가우시안 스플래팅 (3DGS) 재구성부터 객체 수준 및 관계적 의미론적 주석, 그리고 이에 상응하는 공간적 질문 - 답변 (QA) 쌍에 이르기까지 다양한 수준의 공간적 감독을 지원합니다.원칙적이고 체계적인 파이프라인을 따라, 우리는 Holi-Spatial-4M 을 추가로 구축하였습니다. 이는 1 만 2 천 개의 최적화된 3DGS 장면, 130 만 개의 2 차원 마스크, 32 만 개의 3 차원 바운딩 박스, 32 만 개의 인스턴스 캡션, 120 만 개의 3 차원 그라운딩 인스턴스, 그리고 기하학적, 관계적, 의미론적 추론 작업을 아우르는 120 만 개의 공간 QA 쌍을 포함하는, 최초로 대규모 고품질 3 차원 의미론 데이터셋입니다.Holi-Spatial 은 데이터 큐레이션 품질 측면에서 탁월한 성능을 보여주며, ScanNet, ScanNet++, DL3DV 와 같은 데이터셋에서 기존 순방향 (feed-forward) 방식 및 장면별 최적화 기법들을 현저히 능가합니다. 또한, 본 데이터셋을 활용하여 공간 추론 작업에 비전 - 언어 모델 (VLM) 을 미세 조정 (fine-tuning) 한 결과, 모델 성능이 크게 향상되었습니다.

One-sentence Summary

Researchers from Shanghai AI Lab and multiple universities introduce Holi-Spatial, a fully automated pipeline that converts raw videos into high-fidelity 3D scenes using 3D Gaussian Splatting and VLMs. This approach overcomes manual annotation limits to create the Holi-Spatial-4M dataset, significantly boosting spatial reasoning and grounding in Vision-Language Models.

Key Contributions

  • Holi-Spatial addresses the scarcity and imbalance of raw spatial data by introducing a fully automated framework that converts raw video streams into high-fidelity 3D geometry and holistic semantic annotations without requiring explicit 3D sensors or human-in-the-loop labeling.
  • The method employs a three-stage pipeline combining geometric optimization with 3D Gaussian Splatting, image-level perception using open-vocabulary VLMs and SAM3, and scene-level refinement to merge instances and generate detailed captions and grounding pairs.
  • Evaluation on benchmarks like ScanNet++ and DL3DV-10K demonstrates that the resulting Holi-Spatial-4M dataset improves multi-view depth estimation by up to 0.5 F1 and boosts 3D detection AP50 by 64%, while fine-tuning Qwen3-VL yields a 15% gain in 3D grounding accuracy.

Introduction

Spatial intelligence is essential for enabling large multimodal models to perceive and reason about the real 3D world, which is critical for applications like robotic manipulation, navigation, and augmented reality. Current approaches struggle with scalability because they depend on scarce, manually annotated 3D datasets or specialized scanning hardware, resulting in limited semantic coverage and high annotation costs. To overcome these barriers, the authors introduce Holi-Spatial, a fully automated framework that converts raw video streams into high-fidelity 3D geometry with holistic semantic annotations without requiring human labeling or explicit 3D sensors. This system unifies geometric optimization, image-level perception, and scene-level refinement to generate a massive, diverse dataset that significantly improves the performance of downstream 3D grounding and spatial reasoning tasks.

Dataset

  • Dataset Composition and Sources: The authors introduce Holi-Spatial-4M, a fully automated, large-scale dataset derived from raw video streams sourced from ScanNet, ScanNet++, and DL3DV-10K. This collection represents the first large-scale 3D semantic dataset constructed without human intervention, featuring over 12,000 optimized 3D Gaussian Splatting (3DGS) scenes.

  • Key Details for Each Subset: The dataset contains more than 4 million high-quality spatial annotations, including 1.3 million 2D instance masks, 320,000 3D bounding boxes, 320,000 detailed instance captions, and 1.2 million 3D grounding instances. It supports open-vocabulary diversity by leveraging Vision-Language Models to annotate a vast array of fine-grained indoor items rather than relying on a closed set of categories.

  • Usage in Model Training: The authors utilize the data to fine-tune Vision-Language Models for robust spatial reasoning. The 1.25 million generated Spatial Question-Answering pairs are structured into a comprehensive taxonomy covering Camera-centric tasks (such as rotation and movement direction) and Object-centric tasks (including object-to-object distance and size measurement).

  • Processing and Construction Details: The pipeline converts raw video inputs into holistic 3D spatial annotations through geometric optimization and scene-level refinement stages. This automated process generates multi-level supervision ranging from geometrically accurate 3DGS reconstructions with rendered depth maps to object-level and relational semantic annotations, ensuring a balanced distribution of tasks for holistic 3D space understanding.

Method

The authors present Holi-Spatial, a fully automated pipeline designed to transform raw video inputs into high-fidelity 3D geometry and comprehensive spatial annotations. As illustrated in the framework overview, the system supports multi-modal tasks ranging from 3D object detection and reconstruction to spatial reasoning and 3D grounding, ultimately generating over 4 million labels.

The curation framework consists of three core stages. The first stage, Geometric Optimization, distills raw video streams into robust 3D structures. The authors first employ Structure-from-Motion to resolve camera intrinsics and extrinsics, followed by leveraging a spatial foundation model to initialize a dense point cloud. To address noise and outliers inherent in feed-forward depth estimations, the method incorporates 3D Gaussian Splatting (3DGS) for per-scene optimization. This process integrates geometric regularization to enforce multi-view depth consistency, effectively eliminating large-scale floaters that would otherwise interfere with 3D bounding box generation.

The second stage, Image-level Perception, extracts spatially consistent object labels. Keyframes are uniformly sampled from the video stream, and a Vision-Language Model (VLM) generates captions while maintaining a dynamic class-label memory to ensure semantic consistency. Guided by these prompts, SAM3 performs open-vocabulary instance segmentation to produce binary masks and confidence scores.

To lift these 2D masks into 3D, the authors unproject pixels using the refined depth map rendered from the optimized 3DGS. The 3D point P\mathbf{P}P is calculated as: P=Dt(u)K1u~\mathbf{P} = D_t(\mathbf{u}) \cdot \mathbf{K}^{-1}\tilde{\mathbf{u}}P=Dt(u)K1u~ where K\mathbf{K}K is the camera intrinsic matrix and u~\tilde{\mathbf{u}}u~ represents the homogeneous coordinate. To mitigate depth floaters and boundary errors, a geometry-aware filtering strategy is applied. This involves eroding object masks near contours to remove 2D boundary errors and using multi-view-consistent mesh depth to filter 3D outliers, ensuring the estimated initial 3D Oriented Bounding Boxes (OBBs) are derived from a reliable geometry subset.

The final stage is Scene-level Refinement. This coarse-to-fine strategy consolidates redundant detections and verifies instances. First, spatial clustering merges instances that share the same category and have sufficient 3D overlap, defined by the condition: ci=cjIoU3D(Bi,Bj)>τmergec_i = c_j \land \mathrm{IoU}_{3D}(B_i, B_j) > \tau_{\text{merge}}ci=cjIoU3D(Bi,Bj)>τmerge where τmerge\tau_{\text{merge}}τmerge is set to 0.2. Following merging, a post-processing module aligns the 3D OBBs to the global gravity axis. This involves detecting the floor or a fallback planar structure to infer a global up-axis and re-orienting the vertical axis of each instance.

Confidence-based filtering is then applied using a tri-level decision rule. Proposals with high confidence (sk0.9s_k \geq 0.9sk0.9) are kept, while low-confidence noise (sk<0.8s_k < 0.8sk<0.8) is discarded. Ambiguous cases undergo verification by a VLM-based agent equipped with zoom-in and re-segmentation tools.

Upon establishing the final set of validated instances, the system generates dense semantic annotations. It retrieves the optimal source image for each instance and employs a large VLM to generate fine-grained captions and procedurally synthesize spatial QA pairs.

Experiment

  • Framework evaluation demonstrates that the proposed method uniquely achieves high-quality performance across 3D object detection, 2D segmentation, and depth estimation simultaneously, outperforming single-modality baselines.
  • Qualitative results show the framework produces cleaner 3D geometry with significantly fewer ghosting artifacts and floaters compared to prior works, while generating sharper segmentation boundaries and more accurate 3D bounding boxes.
  • VLM finetuning on the curated dataset substantially improves spatial reasoning and 3D grounding capabilities, effectively eliminating viewpoint bias and enabling reliable object localization across different views and depths.
  • Ablation studies confirm that geometric training with refined depth is critical for preventing instance fragmentation and false merging caused by occlusions or depth artifacts.
  • The combination of confidence filtering and agent-based verification successfully balances precision and recall by removing false positives while recovering challenging instances that might otherwise be discarded.
  • Multi-view merging is validated as essential for correcting image-level instance fragmentation and ensuring spatial consistency, leading to robust detection across diverse indoor scenes.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp