HyperAIHyperAI

Command Palette

Search for a command to run...

HSImul3R: 시뮬레이션 준비가 완료된 인간 - 장면 상호작용을 위한 물리 기반 루프 재구성

Yukang Cao Haozhe Xie Fangzhou Hong Long Zhuo Zhaoxi Chen Liang Pan Ziwei Liu

초록

우리는 소규모 촬영 (casual captures), 즉 희소 뷰 이미지 및 단안 비디오로부터 인간 - 장면 상호작용 (HSI) 을 시뮬레이션 준비가 된 3D 재구성으로 변환하는 통합 프레임워크인 HSImul3R 을 제시합니다. 기존 방법들은 인식 - 시뮬레이션 간극 (perception-simulation gap) 으로 인해 고통받고 있습니다. 즉, 시각적으로 그럴듯한 재구성은 종종 물리적 제약을 위반하여 물리 엔진의 불안정성을 초래하고, embodied AI 응용 분야에서 실패를 야기합니다. 이러한 간극을 해소하기 위해, 우리는 물리 시뮬레이터를 능동적 감독자로 활용하여 인간 역학 및 장면 기하학을 공동으로 정제하는 물리 기반 양방향 최적화 파이프라인을 도입합니다. 순방향에서는 Motion fidelity 와 Contact stability 의 이중 감독 하에 인간 운동을 최적화하기 위해 Scene-targeted Reinforcement Learning 을 적용합니다. 역방향에서는 중력 안정성 및 상호작용 성공에 대한 시뮬레이션 피드백을 활용하여 장면 기하학을 정제하는 Direct Simulation Reward Optimization 을 제안합니다. 또한, 다양한 객체 및 상호작용 시나리오를 포함하는 새로운 벤치마크인 HSIBench 를 제시합니다. 광범위한 실험을 통해 HSImul3R 이 최초로 안정적이고 시뮬레이션 준비가 된 HSI 재구성을 생성하며, 실제 휴머노이드 로봇에 직접 배포될 수 있음을 입증합니다.

One-sentence Summary

Researchers from Nanyang Technological University, ACE Robotics, and Shanghai AI Laboratory propose HSImul3R, a unified framework that bridges the perception-simulation gap by using a bi-directional optimization pipeline to refine human motion and scene geometry for stable, simulation-ready human-scene interaction reconstruction from casual captures.

Key Contributions

  • The paper introduces HSImul3R, a unified framework that bridges the perception-simulation gap by employing a physically-grounded bi-directional optimization pipeline where a physics simulator acts as an active supervisor to jointly refine human dynamics and scene geometry.
  • The method implements Scene-targeted Reinforcement Learning for forward motion optimization and Direct Simulation Reward Optimization for reverse geometry refinement, leveraging simulation feedback on gravitational stability and contact constraints to ensure physical validity.
  • This work presents HSIBench, a new benchmark dataset containing diverse objects and interaction scenarios, and demonstrates through extensive experiments that the approach produces the first stable, simulation-ready reconstructions capable of direct deployment on real-world humanoid robots.

Introduction

Embodied AI requires physically valid human-scene interaction data to bridge the gap between visual observation and real-world robotic deployment. Prior methods often produce visually plausible reconstructions that fail in physics engines because they treat human motion and scene geometry as separate problems or optimize solely for 2D image alignment. The authors introduce HSImul3R, a unified framework that uses a physics simulator as an active supervisor to jointly refine human dynamics and scene geometry through a bi-directional optimization pipeline. This approach leverages scene-targeted reinforcement learning to stabilize human motion and direct simulation reward optimization to correct scene geometry, resulting in the first stable, simulation-ready reconstructions that can be directly deployed on humanoid robots.

Method

The proposed method, HSImul3R, reconstructs simulation-ready human-scene interactions from casual captures through a bi-directional optimization pipeline. As shown in the figure below, the framework integrates a forward-pass for motion refinement and a reverse-pass for object geometry correction.

The process begins with the independent reconstruction of static scene geometry and dynamic human motion. The authors utilize DUSt3R for scene structure recovery and employ tools like SAM2, 4DHumans, and ViTPose for human motion estimation. To address the lack of 3D geometric awareness in standard alignment methods, they introduce an explicit 3D structural prior derived from image-to-3D generative models. This step refines the scene geometry and enforces robust interaction constraints. Specifically, the authors optimize the position of the recovered human and generated objects using distinct loss functions for contact and non-contact scenarios. For non-contact cases, the loss minimizes the distance between the closest human body part and object vertices. For contact cases, the loss penalizes penetration depth using a signed distance function.

Following the initial reconstruction, the method employs a forward-pass optimization to ensure stable dynamics. This stage uses a scene-targeted reinforcement learning scheme. The authors introduce a supervision signal that enforces spatial proximity between the humanoid and scene objects, encouraging physically plausible contact. This is achieved by minimizing a loss function scene\ell_{scene}scene, defined as:

scene=1NcontactNsurfi=1Ncontacti=1Nsurfμiokjh22\ell_{scene} = \frac{1}{N_{contact} \cdot N_{surf}} \cdot \sum_{i=1}^{N_{contact}} \sum_{i=1}^{N_{surf}} \| \mu_{i}^{o} - k_{j}^{h} \|_{2}^{2}scene=NcontactNsurf1i=1Ncontacti=1Nsurfμiokjh22

where NcontactN_{contact}Ncontact is the number of contacts between the human and scene objects, and NsurfN_{surf}Nsurf denotes the number of sampled object surface points within the local contact region.

To further rectify structural correctness, a reverse-pass optimization is introduced. This process leverages simulator feedback regarding physical stability to refine the 3D object generation. The authors propose Direct Simulation Reward Optimization (DSRO), which uses the outcome of the simulation as a supervision signal. The DSRO objective incorporates a stability label l(x0)l(x_0)l(x0), which is determined by whether the object remains upright under gravity and achieves a stable final state during interaction. The stability is defined as:

l(x0)={1,if stable0,otherwisel(x_0) = \begin{cases} 1, & \text{if stable} \\ 0, & \text{otherwise} \end{cases}l(x0)={1,0,if stableotherwise

This allows the system to fine-tune the generated objects to eliminate artifacts like missing legs or surface distortions that would otherwise cause simulation failure.

Experiment

  • Reconstruction and simulation experiments demonstrate that the proposed method significantly outperforms existing baselines and variants by achieving stable human-scene interactions, minimizing physical penetration, and preserving meaningful contact states.
  • Qualitative comparisons reveal that the approach generates geometrically accurate object structures with fewer distortions, effectively preventing the unintended object displacement and interaction failures observed in baseline methods.
  • Ablation studies confirm that the scene-targeted simulation loss and the DSRO fine-tuning strategy are critical for maintaining interaction stability and preventing exaggerated motions that lead to object displacement.
  • Real-world deployment on Unitree G1 humanoid robots validates that the refined motions can be successfully transferred to physical hardware to execute complex interaction scenarios.
  • Analysis of input views indicates that while additional views slightly improve motion quality, they have minimal impact on simulation stability or penetration handling.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp