HyperAIHyperAI

Command Palette

Search for a command to run...

HSImul3R:シミュレーション対応型人間・シーン相互作用の物理ループ内再構成

Yukang Cao Haozhe Xie Fangzhou Hong Long Zhuo Zhaoxi Chen Liang Pan Ziwei Liu

概要

本稿では、スパーズビュー画像やモノキュラー動画といったカジュアルなキャプチャから、人間と環境の相互作用(HSI)のシミュレーション対応 3 次元再構成を可能にする統合フレームワーク「HSImul3R」を提案する。既存手法は知覚とシミュレーションの間のギャップに悩まされており、視覚的に妥当な再構成が物理的制約に違反し、物理エンジンの不安定性や具身 AI 応用における失敗を引き起こす傾向がある。このギャップを埋めるため、物理シミュレータを能動的な監督者として位置づけ、人間の動力学と環境幾何学を共同で精緻化する物理基盤型の双方向最適化パイプラインを導入する。順方向においては、運動忠実度と接触安定性の二重監督の下で人間の運動を最適化する「シーン指向強化学習(Scene-targeted Reinforcement Learning)」を採用する。逆方向においては、重力安定性と相互作用の成否に関するシミュレーションフィードバックを活用して環境幾何学を精緻化する「直接シミュレーション報酬最適化(Direct Simulation Reward Optimization)」を提案する。さらに、多様な物体および相互作用シナリオを備えた新たなベンチマーク「HSIBench」を提示する。広範な実験により、HSImul3R が初めて安定したシミュレーション対応の HSI 再構成を実現し、実世界のヒューマノイドロボットへ直接展開可能であることを実証した。

One-sentence Summary

Researchers from Nanyang Technological University, ACE Robotics, and Shanghai AI Laboratory propose HSImul3R, a unified framework that bridges the perception-simulation gap by using a bi-directional optimization pipeline to refine human motion and scene geometry for stable, simulation-ready human-scene interaction reconstruction from casual captures.

Key Contributions

  • The paper introduces HSImul3R, a unified framework that bridges the perception-simulation gap by employing a physically-grounded bi-directional optimization pipeline where a physics simulator acts as an active supervisor to jointly refine human dynamics and scene geometry.
  • The method implements Scene-targeted Reinforcement Learning for forward motion optimization and Direct Simulation Reward Optimization for reverse geometry refinement, leveraging simulation feedback on gravitational stability and contact constraints to ensure physical validity.
  • This work presents HSIBench, a new benchmark dataset containing diverse objects and interaction scenarios, and demonstrates through extensive experiments that the approach produces the first stable, simulation-ready reconstructions capable of direct deployment on real-world humanoid robots.

Introduction

Embodied AI requires physically valid human-scene interaction data to bridge the gap between visual observation and real-world robotic deployment. Prior methods often produce visually plausible reconstructions that fail in physics engines because they treat human motion and scene geometry as separate problems or optimize solely for 2D image alignment. The authors introduce HSImul3R, a unified framework that uses a physics simulator as an active supervisor to jointly refine human dynamics and scene geometry through a bi-directional optimization pipeline. This approach leverages scene-targeted reinforcement learning to stabilize human motion and direct simulation reward optimization to correct scene geometry, resulting in the first stable, simulation-ready reconstructions that can be directly deployed on humanoid robots.

Method

The proposed method, HSImul3R, reconstructs simulation-ready human-scene interactions from casual captures through a bi-directional optimization pipeline. As shown in the figure below, the framework integrates a forward-pass for motion refinement and a reverse-pass for object geometry correction.

The process begins with the independent reconstruction of static scene geometry and dynamic human motion. The authors utilize DUSt3R for scene structure recovery and employ tools like SAM2, 4DHumans, and ViTPose for human motion estimation. To address the lack of 3D geometric awareness in standard alignment methods, they introduce an explicit 3D structural prior derived from image-to-3D generative models. This step refines the scene geometry and enforces robust interaction constraints. Specifically, the authors optimize the position of the recovered human and generated objects using distinct loss functions for contact and non-contact scenarios. For non-contact cases, the loss minimizes the distance between the closest human body part and object vertices. For contact cases, the loss penalizes penetration depth using a signed distance function.

Following the initial reconstruction, the method employs a forward-pass optimization to ensure stable dynamics. This stage uses a scene-targeted reinforcement learning scheme. The authors introduce a supervision signal that enforces spatial proximity between the humanoid and scene objects, encouraging physically plausible contact. This is achieved by minimizing a loss function scene\ell_{scene}scene, defined as:

scene=1NcontactNsurfi=1Ncontacti=1Nsurfμiokjh22\ell_{scene} = \frac{1}{N_{contact} \cdot N_{surf}} \cdot \sum_{i=1}^{N_{contact}} \sum_{i=1}^{N_{surf}} \| \mu_{i}^{o} - k_{j}^{h} \|_{2}^{2}scene=NcontactNsurf1i=1Ncontacti=1Nsurfμiokjh22

where NcontactN_{contact}Ncontact is the number of contacts between the human and scene objects, and NsurfN_{surf}Nsurf denotes the number of sampled object surface points within the local contact region.

To further rectify structural correctness, a reverse-pass optimization is introduced. This process leverages simulator feedback regarding physical stability to refine the 3D object generation. The authors propose Direct Simulation Reward Optimization (DSRO), which uses the outcome of the simulation as a supervision signal. The DSRO objective incorporates a stability label l(x0)l(x_0)l(x0), which is determined by whether the object remains upright under gravity and achieves a stable final state during interaction. The stability is defined as:

l(x0)={1,if stable0,otherwisel(x_0) = \begin{cases} 1, & \text{if stable} \\ 0, & \text{otherwise} \end{cases}l(x0)={1,0,if stableotherwise

This allows the system to fine-tune the generated objects to eliminate artifacts like missing legs or surface distortions that would otherwise cause simulation failure.

Experiment

  • Reconstruction and simulation experiments demonstrate that the proposed method significantly outperforms existing baselines and variants by achieving stable human-scene interactions, minimizing physical penetration, and preserving meaningful contact states.
  • Qualitative comparisons reveal that the approach generates geometrically accurate object structures with fewer distortions, effectively preventing the unintended object displacement and interaction failures observed in baseline methods.
  • Ablation studies confirm that the scene-targeted simulation loss and the DSRO fine-tuning strategy are critical for maintaining interaction stability and preventing exaggerated motions that lead to object displacement.
  • Real-world deployment on Unitree G1 humanoid robots validates that the refined motions can be successfully transferred to physical hardware to execute complex interaction scenarios.
  • Analysis of input views indicates that while additional views slightly improve motion quality, they have minimal impact on simulation stability or penetration handling.

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています