HyperAIHyperAI

Command Palette

Search for a command to run...

HSImul3R: Physik-in-the-Loop-Rekonstruktion simulationsbereiter Mensch-Szenen-Interaktionen

Yukang Cao Haozhe Xie Fangzhou Hong Long Zhuo Zhaoxi Chen Liang Pan Ziwei Liu

Zusammenfassung

Wir stellen HSImul3R vor, ein einheitliches Framework für simulationsbereite 3D-Rekonstruktion von Mensch-Umgebungs-Interaktionen (HSI) aus alltäglichen Aufnahmen, einschließlich spärlich视角iger Bilder und monokularer Videos. Bestehende Methoden leiden unter einer Wahrnehmungs-Simulations-Lücke: visuell plausible Rekonstruktionen verletzen häufig physikalische Randbedingungen, was zu Instabilität in Physik-Engines und zum Versagen in embodied AI-Anwendungen führt. Um diese Lücke zu schließen, führen wir eine physikalisch fundierte, bidirektionale Optimierungspipeline ein, die den Physik-Simulator als aktiven Supervisor nutzt, um menschliche Dynamik und Szenengeometrie gemeinsam zu verfeinern. In Vorwärtsrichtung setzen wir Scene-targeted Reinforcement Learning ein, um menschliche Bewegungen unter dualer Überwachung durch Bewegungstreue und Kontaktstabilität zu optimieren. In Rückwärtsrichtung schlagen wir Direct Simulation Reward Optimization vor, die Simulationsfeedback bezüglich gravitativer Stabilität und Interaktionserfolg nutzt, um die Szenengeometrie zu verfeinern. Darüber hinaus präsentieren wir HSIBench, einen neuen Benchmark mit diversen Objekten und Interaktionsszenarien. Umfassende Experimente zeigen, dass HSImul3R die ersten stabilen, simulationsbereiten HSI-Rekonstruktionen erzeugt und direkt auf reale humanoide Roboter deployt werden kann.

One-sentence Summary

Researchers from Nanyang Technological University, ACE Robotics, and Shanghai AI Laboratory propose HSImul3R, a unified framework that bridges the perception-simulation gap by using a bi-directional optimization pipeline to refine human motion and scene geometry for stable, simulation-ready human-scene interaction reconstruction from casual captures.

Key Contributions

  • The paper introduces HSImul3R, a unified framework that bridges the perception-simulation gap by employing a physically-grounded bi-directional optimization pipeline where a physics simulator acts as an active supervisor to jointly refine human dynamics and scene geometry.
  • The method implements Scene-targeted Reinforcement Learning for forward motion optimization and Direct Simulation Reward Optimization for reverse geometry refinement, leveraging simulation feedback on gravitational stability and contact constraints to ensure physical validity.
  • This work presents HSIBench, a new benchmark dataset containing diverse objects and interaction scenarios, and demonstrates through extensive experiments that the approach produces the first stable, simulation-ready reconstructions capable of direct deployment on real-world humanoid robots.

Introduction

Embodied AI requires physically valid human-scene interaction data to bridge the gap between visual observation and real-world robotic deployment. Prior methods often produce visually plausible reconstructions that fail in physics engines because they treat human motion and scene geometry as separate problems or optimize solely for 2D image alignment. The authors introduce HSImul3R, a unified framework that uses a physics simulator as an active supervisor to jointly refine human dynamics and scene geometry through a bi-directional optimization pipeline. This approach leverages scene-targeted reinforcement learning to stabilize human motion and direct simulation reward optimization to correct scene geometry, resulting in the first stable, simulation-ready reconstructions that can be directly deployed on humanoid robots.

Method

The proposed method, HSImul3R, reconstructs simulation-ready human-scene interactions from casual captures through a bi-directional optimization pipeline. As shown in the figure below, the framework integrates a forward-pass for motion refinement and a reverse-pass for object geometry correction.

The process begins with the independent reconstruction of static scene geometry and dynamic human motion. The authors utilize DUSt3R for scene structure recovery and employ tools like SAM2, 4DHumans, and ViTPose for human motion estimation. To address the lack of 3D geometric awareness in standard alignment methods, they introduce an explicit 3D structural prior derived from image-to-3D generative models. This step refines the scene geometry and enforces robust interaction constraints. Specifically, the authors optimize the position of the recovered human and generated objects using distinct loss functions for contact and non-contact scenarios. For non-contact cases, the loss minimizes the distance between the closest human body part and object vertices. For contact cases, the loss penalizes penetration depth using a signed distance function.

Following the initial reconstruction, the method employs a forward-pass optimization to ensure stable dynamics. This stage uses a scene-targeted reinforcement learning scheme. The authors introduce a supervision signal that enforces spatial proximity between the humanoid and scene objects, encouraging physically plausible contact. This is achieved by minimizing a loss function scene\ell_{scene}scene, defined as:

scene=1NcontactNsurfi=1Ncontacti=1Nsurfμiokjh22\ell_{scene} = \frac{1}{N_{contact} \cdot N_{surf}} \cdot \sum_{i=1}^{N_{contact}} \sum_{i=1}^{N_{surf}} \| \mu_{i}^{o} - k_{j}^{h} \|_{2}^{2}scene=NcontactNsurf1i=1Ncontacti=1Nsurfμiokjh22

where NcontactN_{contact}Ncontact is the number of contacts between the human and scene objects, and NsurfN_{surf}Nsurf denotes the number of sampled object surface points within the local contact region.

To further rectify structural correctness, a reverse-pass optimization is introduced. This process leverages simulator feedback regarding physical stability to refine the 3D object generation. The authors propose Direct Simulation Reward Optimization (DSRO), which uses the outcome of the simulation as a supervision signal. The DSRO objective incorporates a stability label l(x0)l(x_0)l(x0), which is determined by whether the object remains upright under gravity and achieves a stable final state during interaction. The stability is defined as:

l(x0)={1,if stable0,otherwisel(x_0) = \begin{cases} 1, & \text{if stable} \\ 0, & \text{otherwise} \end{cases}l(x0)={1,0,if stableotherwise

This allows the system to fine-tune the generated objects to eliminate artifacts like missing legs or surface distortions that would otherwise cause simulation failure.

Experiment

  • Reconstruction and simulation experiments demonstrate that the proposed method significantly outperforms existing baselines and variants by achieving stable human-scene interactions, minimizing physical penetration, and preserving meaningful contact states.
  • Qualitative comparisons reveal that the approach generates geometrically accurate object structures with fewer distortions, effectively preventing the unintended object displacement and interaction failures observed in baseline methods.
  • Ablation studies confirm that the scene-targeted simulation loss and the DSRO fine-tuning strategy are critical for maintaining interaction stability and preventing exaggerated motions that lead to object displacement.
  • Real-world deployment on Unitree G1 humanoid robots validates that the refined motions can be successfully transferred to physical hardware to execute complex interaction scenarios.
  • Analysis of input views indicates that while additional views slightly improve motion quality, they have minimal impact on simulation stability or penetration handling.

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp