HyperAIHyperAI

Command Palette

Search for a command to run...

外見的生成を汎用的な対話型ヒューマノイド制御として捉えるExoActor

Yanghao Zhou Jingyu Ma Yibo Peng Zhenguo Sun Yu Bai Börje F. Karlsson

概要

ヒューマノイド制御システムは近年大幅な進歩を遂げているものの、ロボット、その周辺環境、およびタスクに関連する物体間の流暢で対話的な振る舞いをモデル化することは依然として基本的な課題である。この困難さは、広範囲な空間的文脈、時間的動態、ロボットの行動、およびタスクの意図を統合的に捉える必要があることから生じ、これは従来の教師あり学習の枠組みとは相容れない性質を持っている。本研究は、大規模な動画生成モデルの汎化能力を活用してこの問題に対処する、ExoActorという新たなフレームワークを提案する。ExoActorの核心的な洞察は、対話的動態をモデル化するための統一されたインターフェースとして三人称視点の動画生成を利用することである。タスク指示とシーン文脈を入力として、ExoActorはロボット、環境、物体間の協調的な相互作用を暗黙的にエンコードする、妥当な実行プロセスを合成する。この動画出力は、人間の動きを推定し、一般的なモーションコントローラを用いて実行するパイプラインを経て、実行可能なヒューマノイドの行動へと変換され、タスク条件に応じた行動シーケンスが生成される。提案されたフレームワークを検証するため、本システムをエンドツーエンドの実装として構築し、追加の現実世界データ収集なしに新しいシナリオへの汎化能力を実証した。さらに、本研究は現在の実装における限界について議論し、将来の研究への有望な方向性を概説することで、ExoActorがいかにして対話的豊かなヒューマノイド行動をモデル化するためのスケーラブルなアプローチを提供し、生成モデルが汎用ヒューマノイド知能の進展のために新たな道を開く可能性があるかを示している。

One-sentence Summary

The authors propose ExoActor, a framework that leverages exocentric video generation as a unified interface to implicitly encode coordinated robot-environment-object interactions, converting synthesized execution videos into executable humanoid behaviors via human motion estimation and a general motion controller to demonstrate generalization to new scenarios without additional real-world data collection.

Key Contributions

  • ExoActor is an end-to-end framework that leverages large-scale video generation models to capture spatial, temporal, and task-driven dynamics between humanoid robots, environments, and objects.
  • The method employs third-person video generation as a unified interface to synthesize execution processes that implicitly encode coordinated interactions, which are subsequently converted into executable behaviors through a pipeline combining human motion estimation and a general motion controller.
  • The implemented system demonstrates generalization to novel scenarios without requiring additional real-world data collection.

Introduction

Humanoid robots must execute fluent, interaction-rich behaviors across unstructured environments to enable general-purpose intelligence. Prior approaches struggle to jointly model spatial context, temporal dynamics, and task intent at scale, often relying on costly real-world data collection, domain-specific tuning, or curated demonstrations that fail to generalize outside controlled settings. To overcome these bottlenecks, the authors leverage large-scale third-person video generation as a unified interface for modeling complex interaction dynamics. Their ExoActor framework synthesizes plausible execution videos from task instructions and scene context, then translates these imagined demonstrations into executable whole-body behaviors through a unified motion estimation and tracking pipeline. This design decouples high-level interaction planning from low-level control, allowing humanoid systems to adapt to novel scenarios without additional real-world data collection.

Dataset

  • Dataset composition and sources: The authors do not specify the dataset composition or data sources in the provided excerpt.
  • Key details for each subset: No information is given regarding subset sizes, origins, or filtering rules.
  • Model usage and training configuration: The excerpt does not describe how the authors split the data, set mixture ratios, or process it for training.
  • Processing and metadata construction: No cropping strategies, metadata construction steps, or additional preprocessing details are mentioned.

Method

The ExoActor framework operates as a unified pipeline that translates high-level task instructions into executable humanoid behaviors by leveraging third-person video generation as an intermediate representation. The overall system is structured into three primary stages: video generation, motion estimation, and motion execution. As shown in the framework diagram, the process begins with an initial third-person observation of the robot in the environment and a task instruction. This information is first processed through a task-to-action decomposition module, which converts the abstract goal into a temporally ordered sequence of atomic actions, providing a structured plan for the subsequent stages. The decomposed action chain, combined with the initial observation, serves as a scene- and task-aware description that guides the video generation process.

Refer to the framework diagram to understand the flow of information through the system. The next stage involves transforming the robot-centered observation into a human-compatible representation, a step essential for overcoming the embodiment mismatch between the robot and the human-centric priors of video generation models. This robot-to-human embodiment transfer, illustrated in Figure 3, converts the robot into a human subject while strictly preserving the original scene layout, camera viewpoint, body pose, orientation, scale, and robot-body proportions. This transformation ensures that the input to the video generation model aligns with its learned data distribution, thereby improving temporal consistency, reducing visual artifacts, and enhancing motion estimation accuracy. The transfer is implemented using a prompt-based image editing interface, which enforces the necessary constraints on pose, orientation, and scene preservation.

Following the embodiment transfer, the system generates a sequence of third-person action videos that are both task-consistent and compatible with the embodiment constraints of the target robot. This is achieved through a structured prompt template that explicitly encodes scene constraints, motion requirements, and task execution details. The prompt takes the initial observation and the enriched action description as inputs, organizing them into structured fields such as Shot, Scene, Motion, Execution, and End State. These fields act as explicit constraints that enforce a fixed camera viewpoint, preserve scene geometry, and encourage natural, physically plausible, and robot-aligned motion patterns, thereby reducing hallucinations and improving temporal consistency. The video generation model used is Kling, chosen for its superior stability and consistency.

The generated videos are then fed into an interaction-aware whole-body motion estimation module to recover the 3D kinematic trajectories of the human actor. This stage aims to bridge the gap between pixel-level video synthesis and structured robot control. The authors employ GENMO, a diffusion-based model, to perform whole-body motion estimation, which formulates motion estimation as a constrained generation process. Instead of direct pose regression, the model leverages video features and 2D keypoints as conditioning signals to produce temporally consistent and physically plausible 3D motion sequences. The estimated motion is represented using SMPL parameters, including joint rotations and global positions, and the model performs temporal in-filling for partially occluded frames, producing a smooth and physically consistent motion sequence. This interaction-aware motion serves as a structured representation of the generated videos and provides a reliable interface for downstream robot execution commands.

To capture the fine-grained manipulation required for object interaction, the system applies WiLoR frame by frame to the generated third-person video to estimate bilateral hand poses. This hand motion estimation, illustrated in Figure 5, tracks the left and right hand poses at the original video frame rate, yielding one prediction per video frame. Each hand is assigned a discrete interaction state, corresponding to open, half-open, and closed grasp states. The smoothed hand poses and interaction states are synchronized with the whole-body motion and mapped frame-wise to robot end-effector commands. The final interaction-aware motion representation combines the global body motion, bilateral hand poses, and discrete manipulation states, providing a comprehensive motion trajectory.

The final stage of the pipeline is the general motion tracking deployment, where the estimated motions are transformed into physically consistent and dynamically feasible control policies. The system leverages SONIC as its motion tracking controller, which takes the current robot state and a temporal window of reference motions as input. By utilizing SONIC's scaled-up architecture, the system effectively "physics-filters" the "imagined" demonstrations from the video generation model, ensuring dynamic stability and stylistic fidelity without requiring task-specific reward engineering. For hand motion deployment, the estimated hand states are mapped to Unitree's Dex3-1-compatible 7-DoF joint targets and streamed together with the SMPL body trajectory through a queue input interface to the robot. This allows the robot to execute the generated behaviors in a physically realistic manner.

Experiment

The framework is evaluated through real-world zero-shot tasks spanning basic navigation to fine-grained multi-step manipulation, validating its capacity to translate video-driven prompts into stable whole-body humanoid behaviors. Qualitative analysis indicates that while the system reliably executes complex locomotion and interaction sequences, it encounters inherent limitations in video generation fidelity, motion estimation under occlusion, and precise height alignment during object manipulation. Ablation studies further demonstrate that bypassing intermediate retargeting preserves critical spatial accuracy, task-aware camera perspectives optimize performance, and selecting specific generative and estimation models yields the most robust trade-offs between physical plausibility and computational efficiency.

The authors evaluate ExoActor through a series of real-world experiments designed to assess its ability to generate and execute interaction-rich humanoid behaviors across tasks of increasing difficulty. The system's performance is analyzed in terms of computational efficiency and the time required for different modules, with a focus on video generation, motion estimation, and task decomposition. The system's overall performance is influenced by the time required for video generation and motion estimation, with hand motion estimation being the most time-consuming module. Task decomposition and prompt construction are relatively fast compared to other modules, indicating efficient processing for high-level task planning. Video generation and whole-body motion estimation require more time per video, suggesting higher computational demands for these stages.

The authors evaluate ExoActor through a series of real-world experiments designed to validate its capacity to generate and execute complex humanoid interactions across tasks of escalating difficulty. Performance analysis reveals that while high-level task planning remains computationally efficient, hand motion estimation serves as the primary processing bottleneck. Ultimately, the findings indicate that generating detailed video sequences and whole-body movements demands significantly more resources, highlighting a clear trade-off between behavioral complexity and system efficiency.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています