HyperAIHyperAI

Command Palette

Search for a command to run...

DeVI: 합성 비디오 모방을 통한 물리 기반의 숙련된 인간-객체 상호작용 (Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation)

Hyeonwoo Kim Jeonghwan Kim Kyungwon Cho Hanbyul Joo

초록

최근 비디오 생성 모델(video generative models)의 발전으로 인해, 모션 캡처 시스템으로 포착하기 어려운 복잡하고 정교한 조작(dexterous manipulations)을 포함하여 광범위한 시나리오와 객체 카테고리에 걸친 사실적인 인간-객체 상호작용(human-object interaction) 비디오 합성이 가능해졌습니다. 이러한 합성 비디오에 내재된 풍부한 상호작용 지식은 정교한 로봇 조작을 위한 모션 계획(motion planning) 분야에서 큰 잠재력을 가지고 있지만, 낮은 물리적 충실도(physical fidelity)와 순수 2D 기반이라는 특성 때문에 물리 기반 캐릭터 제어(physics-based character control)의 모방 대상(imitation targets)으로 직접 활용하기에는 어려움이 있습니다.본 논문에서는 텍스트 조건부(text-conditioned) 합성 비디오를 활용하여, 본 적 없는 새로운 대상 객체와 상호작용할 수 있도록 물리적으로 타당한 정교한 agent 제어를 가능하게 하는 새로운 프레임워크인 DeVI (Dexterous Video Imitation)를 제안합니다. 생성된 2D 단서의 부정확성을 극복하기 위해, 본 연구에서는 3D 인간 트래킹과 강건한 2D 객체 트래킹을 통합한 하이브리드 트래킹 보상(hybrid tracking reward)을 도입합니다. 고품질의 3D 운동학적 시연(kinematic demonstrations)에 의존하는 기존 방식과 달리, DeVI는 생성된 비디오만을 필요로 하므로 다양한 객체와 상호작용 유형에 대해 제로샷 일반화(zero-shot generalization)가 가능합니다.광범위한 실험을 통해 DeVI가 3D 인간-객체 상호작용 시연을 모방하는 기존 방식보다 뛰어난 성능을 보임을 입증하였으며, 특히 정교한 손-객체 상호작용(hand-object interactions) 모델링에서 탁월한 성능을 나타냈습니다. 나아가 다중 객체 장면 및 텍스트 기반 액션 다양성 측면에서 DeVI의 효과를 검증함으로써, 인간-객체 상호작용(HOI) 인지형 모션 플래너로서 비디오를 사용하는 것의 이점을 보여주었습니다.

One-sentence Summary

The DeVI framework enables physically plausible dexterous robotic manipulation by imitating text-conditioned synthetic videos through a hybrid tracking reward that integrates 3D human tracking with robust 2D object tracking, achieving superior zero-shot generalization across diverse objects compared to existing methods that rely on 3D kinematic demonstrations.

Key Contributions

  • The paper introduces DeVI, a framework that utilizes video diffusion models as motion planners to enable physically plausible dexterous human-object interaction in physics simulations without requiring high-quality 3D motion capture data.
  • This work presents a hybrid tracking reward system that combines 3D human tracking with robust 2D object tracking and visual HOI alignment to reconstruct object-aligned motion without the need for 6D object pose estimation.
  • Extensive experiments demonstrate that the method achieves zero-shot generalization across diverse objects and outperforms existing approaches that rely on 3D kinematic demonstrations, particularly in modeling complex dexterous hand-object interactions.

Introduction

Achieving realistic dexterous human-object interaction (HOI) in physics simulations is essential for training robotic agents to perform complex tasks like grasping or functional manipulation. While video generative models can synthesize diverse interaction scenarios, they produce 2D content that lacks the physical fidelity and 3D depth required for direct imitation in a physics engine. Existing imitation methods often rely on high-quality 3D motion capture data, which is difficult to scale across diverse objects and complex movements. The authors propose DeVI, a framework that uses text-conditioned synthetic videos as an interaction-aware motion planner. They introduce a hybrid tracking reward that combines 3D human tracking with robust 2D object tracking to overcome the imprecision of 2D generative cues, enabling zero-shot generalization for dexterous manipulation without requiring 3D demonstrations.

Method

The framework of DeVI (Dexterous Video Imitation) is designed to learn a humanoid control policy that enables a simulated character to imitate complex human-object interactions (HOI) from 2D video inputs. The overall process begins with a 3D scene initialization, where a human, parameterized by the SMPL-X model, and a target object are placed in a tabletop environment. To enhance realism for video synthesis, the SMPL-X mesh is replaced with a textured human mesh from the THuman2.0 dataset, which is deformed via linear blend skinning to match the pose of the initial SMPL-X model. This textured scene is then rendered from a set of predefined camera viewpoints to generate a 2D image, which serves as the starting point for video generation.

As shown in the figure below, the rendered scene is used to generate a 2D HOI video using a pre-trained video diffusion model. This video, which is conditioned on a text prompt describing the desired interaction, contains the visual information needed to extract hybrid imitation targets. The goal of the method is to learn a policy that can replicate the motion observed in the video, even when the original 3D motion capture data is not available.

The hybrid imitation targets are extracted from the generated 2D video. For the human component, an off-the-shelf monocular motion estimator is applied to recover a 3D SMPL-X human motion sequence. This initial reconstruction is refined through a Visual HOI Alignment optimization process, which jointly aligns the estimated human pose with both the 2D projections in the video and the initial 3D object state. This alignment involves minimizing a composite loss function that includes 2D projection losses for body and hand joints, a temporal consistency loss, and a one-sided Chamfer distance loss to enforce contact between the human and the object. For the object component, a 2D trajectory is constructed by tracking visible object vertices across the video frames using a video tracker, which provides the 2D object reference.

The humanoid control policy, πθ\pi_{\theta}πθ, is trained to track these hybrid targets. The policy takes as input the current character state sts_tst (comprising human and object states) and a goal vector gtg_tgt, which is defined as the future kkk entities of the 3D human motion reference. The learning objective is to maximize the expected discounted cumulative reward, J(θ)J(\theta)J(θ), which is optimized using Proximal Policy Optimization (PPO). The reward function is designed to be a product of three components: a human tracking reward (RhR_hRh), an object tracking reward (RoR_oRo), and a contact reward (RcontactR_{\text{contact}}Rcontact).

The human tracking reward encourages the simulated character to match the 3D human motion reference, combining joint position, velocity, and rotation differences, along with a power penalty for smooth and physically plausible actions. The object tracking reward, RoR_oRo, is defined as an exponential function of the negative squared error between the 2D projection of the simulated object and the 2D object trajectory extracted from the video. The contact reward, RcontactR_{\text{contact}}Rcontact, is a product of a contact force reward and a contact distance reward. It is modulated by a binary contact label, ψt\psi_tψt, which is automatically estimated from the video by analyzing the motion of the object vertices and hand joints to infer the timing of contact.

The actor-critic network architecture for the control policy is illustrated in the figure below. The actor network, which outputs the action, consists of separate Multi-Layer Perceptrons (MLPs) for the human state, object state, and target future pose. These representations are concatenated and passed through a sequence transformer encoder for joint encoding, followed by a multi-layer MLP to produce the action. The critic network, which estimates the value function, takes the same inputs, concatenates them, and passes them through a multi-layer MLP. The policy is trained by updating the actor and critic networks using the PPO algorithm, with the hybrid tracking reward guiding the learning process.

Experiment

The researchers evaluate DeVI by comparing its ability to imitate dexterous human-object interactions against state-of-the-art 3D motion imitation baselines using the GRAB dataset and diverse synthetic video scenarios. The experiments validate that the proposed hybrid imitation target and visual HOI alignment effectively bridge the gap between 2D video cues and 3D physics-based control. Results demonstrate that the framework achieves superior motion accuracy and higher success rates than methods relying on complex 6D poses, while also showing strong text controllability and generalization to multi-object scenes.

The authors evaluate their method against baselines using success ratios across multiple metrics, showing that their approach achieves higher success rates in imitating human-object interaction motions. The results indicate that the proposed method outperforms baselines in both human and object tracking accuracy, particularly under relaxed thresholds. The method demonstrates robust performance even when using 2D object trajectories instead of 6D poses, highlighting the effectiveness of the hybrid imitation target. DeVI achieves higher success ratios than baselines across all evaluation metrics and thresholds. The method outperforms baselines in both human and object tracking accuracy, especially under relaxed constraints. DeVI demonstrates robust performance using 2D object trajectories, showing the effectiveness of the hybrid imitation target.

The authors compare their method DeVI against several baselines on a dataset involving human-object interactions, using metrics for human and object motion accuracy. Results show that DeVI achieves lower error in human motion tracking and object trajectory, particularly in hand and root joint positions, and outperforms baselines in success rates. Ablation studies confirm the importance of visual HOI alignment for accurate hand-object interaction and demonstrate that using 2D object trajectories as a reward is effective compared to 6D pose tracking. DeVI outperforms baselines in human motion tracking and object trajectory accuracy, especially for hand and root joint positions. Visual HOI alignment significantly improves the alignment of human motion with both video frames and 3D objects, enhancing interaction realism. Using 2D object trajectories as a reward leads to better performance than 6D pose tracking, indicating a more effective and practical approach for imitation.

The authors evaluate the effectiveness of visual HOI alignment in their framework by comparing results with and without the alignment component. Results show that incorporating visual HOI alignment significantly improves the alignment between the generated 3D human motion and the input video, particularly for hand joints, and enhances the accuracy of hand-object interactions. The method achieves lower error in human pose tracking and better contact precision with the object compared to the baseline without alignment. Visual HOI alignment improves the alignment of 3D human motion with the input video, especially for hand joints. The method with visual HOI alignment achieves better contact precision and reduces the distance between hands and objects during interaction. Incorporating visual HOI alignment leads to more accurate and plausible hand-object interactions in the simulated motions.

The authors evaluate DeVI against several baselines using human and object motion accuracy metrics to validate its ability to imitate human-object interactions. The results demonstrate that DeVI achieves superior tracking accuracy and higher success rates, particularly in hand and root joint positioning. Ablation studies further confirm that visual HOI alignment significantly enhances interaction realism and contact precision, while the use of 2D object trajectories provides a robust and effective imitation target.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
DeVI: 합성 비디오 모방을 통한 물리 기반의 숙련된 인간-객체 상호작용 (Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation) | 문서 | HyperAI초신경