HyperAIHyperAI

Command Palette

Search for a command to run...

DreamDojo:大規模なヒューマンビデオから得た汎用ロボットワールドモデル

概要

多様な環境における行動の結果をシミュレートできるようになれば、スケール規模での汎用エージェント開発に革命をもたらすだろう。しかし、特に器用なロボット操作タスクにおいて、世界の動的挙動をモデル化することは、データカバレッジの限界や行動ラベルの不足という大きな課題を抱えている。この課題に取り組むべく、本研究では、44,000時間に及ぶエゴセントリックな人間の動画から多様な相互作用と器用な制御を学習する、基礎的世界モデル「DreamDojo」を提案する。本研究で用いたデータ混合は、世界モデルの事前学習に用いられるこれまでで最大規模の動画データセットであり、多様な物体や技能を含む日常的な場面を幅広くカバーしている。行動ラベルの不足に対処するため、連続的な潜在行動(continuous latent actions)を統一された代理行動として導入し、ラベルなし動画から得られる相互作用の知識を効果的に転移可能にした。小規模なターゲットロボットデータ上で微調整(post-training)を行った後、DreamDojoは物理法則の理解と正確な行動制御性を示した。さらに、リアルタイム動作(10.81 FPS)を実現するための蒸留パイプラインを構築し、文脈的一貫性もさらに向上させた。本研究により、生成型世界モデルに基づく複数の重要な応用が可能になった。具体的には、リアルタイム遠隔操作、ポリシー評価、モデルベースの計画などが挙げられる。複数の困難な分布外(OOD)ベンチマークにおける体系的な評価により、本手法がオープンワールドかつ接触を伴うタスクのシミュレーションにおいて重要な意義を持つことが確認され、汎用ロボット用の世界モデルの実現に向けた道筋が示された。

One-sentence Summary

Researchers from NVIDIA, HKUST, UC Berkeley, and others propose DREAMDoJO, a foundation world model trained on 44k hours of egocentric videos, using latent actions to overcome label scarcity and enabling real-time, physics-aware robotic simulation for teleoperation and planning in open-world tasks.

Key Contributions

  • DREAMDoJO is a foundation world model pretrained on 44k hours of egocentric human videos, the largest dataset to date for this task, enabling zero-shot generalization to unseen objects and environments by leveraging consistent physics across human and robot interactions.
  • To overcome the scarcity of action labels, it introduces continuous latent actions as unified proxy actions, allowing self-supervised learning of fine-grained controllability and physics from unlabeled videos, which significantly improves transfer to dexterous robot tasks after minimal post-training.
  • A novel distillation pipeline accelerates the model to 10.81 FPS at 640×480 resolution while enhancing long-horizon consistency, enabling real-time applications like teleoperation and model-based planning, validated on challenging out-of-distribution benchmarks.

Introduction

The authors leverage large-scale human video data to train DREAMDoJO, a foundation world model capable of simulating dexterous robotic tasks in open, unseen environments. Prior video world models struggle with high-dimensional robot actions and limited, expert-only datasets, which restrict generalization and counterfactual reasoning. To overcome sparse action labels, they introduce continuous latent actions as unified proxies, enabling scalable, self-supervised learning of physics and controllability from 44k hours of egocentric human videos—the largest such dataset to date. Their distillation pipeline accelerates inference to 10.81 FPS while preserving visual quality and long-horizon consistency, unlocking real-time applications like teleoperation and model-based planning.

Dataset

  • The authors use DreamDojo-HV, a 44,711-hour egocentric video dataset, to pretrain a world model capable of generalizing across objects, tasks, and environments. It is the largest human interaction dataset for this purpose to date.

  • The dataset combines three sources:

    • In-lab: Lab-collected tabletop interactions using Manus gloves and Vive trackers for precise hand pose capture; enables direct retargeting to GR-1 robot actions and includes novel objects/verbs.
    • EgoDex: 829 hours of public egocentric videos from Apple Vision Pro, with 3D hand/finger tracking and diverse household objects to expand object variety.
    • DreamDojo-HV (in-house): Crowdsourced videos covering loco-manipulation skills across household, industrial, retail, educational, and administrative settings; each clip includes task text annotations.
  • The full dataset includes 9,869 unique scenes, 6,015 unique tasks, and 43,237 unique objects — 15x longer, 96x more skills, and 2,000x more scenes than prior world model datasets.

  • For training, the authors mix these subsets without explicit ratios mentioned, but emphasize scale and diversity drive performance gains. No cropping strategy is described; metadata is built from task text annotations per clip and scene/task/object identifiers.

  • The dataset enables real-time future prediction with continuous actions, supports teleoperation, and allows online model-based planning without real-world deployment.

Method

The authors leverage a three-phase training pipeline to build DREAMDoJO, a world model capable of simulating robot interactions from human video data and adapting to target embodiments. The overall framework begins with pretraining on diverse egocentric human datasets, followed by post-training on robot-specific data, and concludes with a distillation stage to enable real-time autoregressive generation. Refer to the framework diagram for a high-level overview of this pipeline.

At the core of the architecture is the Cosmos-Predict2.5 model, a latent video diffusion model operating in the continuous latent space produced by the WAN2.2 tokenizer. The model conditions on text, frames, and actions via cross-attention and adaptive layer normalization, trained using flow matching loss. To enhance action controllability — critical for robotic applications — the authors introduce two key architectural modifications. First, they transform absolute robot joint poses into relative actions by rebaselining each latent frame’s input to the pose at its start (every 4 timesteps), which reduces modeling complexity and improves generalization. Second, to respect causality, they inject actions into latent frames in chunks of 4 consecutive actions, rather than as a global condition, thereby eliminating future action leakage and improving learning efficiency.

To enable pretraining on unlabeled human videos, the authors introduce a latent action model based on a spatiotemporal Transformer VAE. This model extracts compact, disentangled action representations from consecutive video frames ft:t+1f^{t:t+1}ft:t+1, producing a latent action a^t\hat{a}_ta^t that the decoder uses — along with ftf^tft — to reconstruct ft+1f^{t+1}ft+1. The training objective combines reconstruction loss and KL regularization to enforce an information bottleneck, ensuring the latent encodes only the most critical motion. As shown in the figure below, this latent action model successfully captures human actions and enables cross-embodiment generalization, allowing the same action representation to be used across diverse robotic platforms.

During pretraining, the authors condition each latent frame on chunked latent actions projected via a lightweight MLP, initialized with zero weights to preserve pretrained physics knowledge. They further enhance temporal coherence by augmenting the standard flow matching loss with a temporal consistency loss that penalizes discrepancies in velocity differences between consecutive latent frames:

Ltemporal(θ)=E[i=1K1(zi+1zi)(vi+1vi))2].\mathcal { L } _ { \mathrm { t e m p o r a l } } ( \theta ) = \mathbb { E } \Big [ \sum _ { i = 1 } ^ { K - 1 } \big \| ( z ^ { i + 1 } - z ^ { i } ) - ( v ^ { i + 1 } - v ^ { i } ) \big ) \big \| ^ { 2 } \Big ] .Ltemporal(θ)=E[i=1K1(zi+1zi)(vi+1vi))2].

The final training objective is a weighted sum of the flow matching and temporal consistency losses:

Lfinal(θ)=Lflow(θ)+λLtemporal(θ),\mathcal { L } _ { \mathrm { f i n a l } } ( \theta ) = \mathcal { L } _ { \mathrm { f l o w } } ( \theta ) + \lambda \, \mathcal { L } _ { \mathrm { t e m p o r a l } } ( \theta ) ,Lfinal(θ)=Lflow(θ)+λLtemporal(θ),

with λ=0.1\lambda = 0.1λ=0.1 in practice.

For post-training on target robots, the authors reinitialize the first layer of the action MLP to match the robot’s action space and finetune the entire model. This stage enables adaptation to specific embodiments — such as GR-1, G1, AgiBot, and YAM — using only limited robot data, while preserving the generalization benefits of pretraining.

Finally, to enable real-time applications like live teleoperation and model-based planning, the authors distill the foundation model into an autoregressive student model. This involves replacing bidirectional attention with causal attention and reducing the denoising steps from 50 to 4. The distillation proceeds in two stages: a warmup phase where the student regresses to teacher-generated ODE trajectories, and a distillation phase where the student generates autoregressively and is supervised via a KL-based distribution matching loss. To mitigate compounding error, the student is trained to generate longer rollouts than the teacher, with supervision applied via randomly sampled windows.

Experiment

  • Latent action conditioning enables significantly better transfer from human videos compared to action-free pretraining, nearly matching ideal ground-truth action settings.
  • Incorporating diverse human datasets during pretraining improves generalization to novel physical interactions and counterfactual actions.
  • Architectural enhancements—including relative actions, chunked injection, and temporal consistency loss—substantially boost simulation accuracy and action controllability.
  • The distillation pipeline yields a lightweight, real-time model that maintains high fidelity over long-horizon rollouts while offering better context awareness and robustness to occlusions.
  • DREAMDoJO demonstrates strong out-of-distribution generalization, especially in edited or novel environments, validated through human preference evaluations.
  • Downstream applications show DREAMDoJO reliably evaluates policies, enables effective model-based planning with significant performance gains, and supports real-time teleoperation.

Results show that incorporating more diverse human video datasets during pretraining consistently improves performance across out-of-distribution scenarios and counterfactual actions. The DREAMDoJO-14B variant achieves the strongest overall scores, particularly in SSIM and LPIPS metrics, indicating better structural and perceptual fidelity in generated videos. Adding the DreamDojo-HV dataset to the pretraining mixture further boosts generalization, especially on novel interactions not seen in robot training data.

The authors use latent action conditioning to bridge the performance gap between action-free pretraining and ideal ground-truth action setups, achieving near-parity with retargeted actions in simulation quality. Results show that latent actions significantly improve transfer from human videos while remaining scalable and practical without requiring specialized motion capture hardware.

The authors evaluate architectural and loss design choices by incrementally applying relative actions, chunked injection, and temporal consistency loss to a base model. Results show that each modification contributes to improved simulation quality on both expert and counterfactual trajectories, with the full combination achieving the best performance across all metrics. This confirms that precise action controllability and temporal coherence are critical for accurate dynamics prediction.

The authors use latent action conditioning to bridge the performance gap between action-free pretraining and ideal ground-truth action setups, achieving near-parity with MANO-based conditioning in simulation quality. Results show that latent actions significantly improve transfer from human videos while remaining scalable and practical without requiring specialized motion capture hardware.

The authors use latent action conditioning to bridge the gap between human video pretraining and robot action execution, achieving performance close to ideal setups that require precise motion capture. Results show that pretraining with latent actions significantly improves simulation quality across diverse evaluation benchmarks compared to action-free or no pretraining, particularly in out-of-distribution scenarios. This approach enables scalable and effective transfer of human interaction knowledge to robotic systems without relying on specialized hardware for action labeling.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています