HyperAIHyperAI

Command Palette

Search for a command to run...

드림도조: 대규모 인간 영상에서 얻은 일반화 로봇 월드 모델

초록

다양한 환경에서 행동의 결과를 시뮬레이션할 수 있게 되면, 대규모 일반화 에이전트의 개발이 혁신적으로 변화할 것이다. 그러나 특히 정교한 로봇 작업에 있어서 세계 동역학을 모델링하는 것은 데이터 커버리지의 제한과 행동 레이블의 희소성으로 인해 큰 도전 과제를 안고 있다. 이러한 목표를 달성하기 위한 노력으로, 우리는 44,000시간에 달하는 자기 중심 인간 영상에서 다양한 상호작용과 정교한 제어를 학습하는 기초 세계 모델인 DreamDojo를 제안한다. 본 연구의 데이터 믹스는 세계 모델 사전학습을 위해 현재까지 가장 큰 영상 데이터셋으로, 다양한 물체와 기술을 포함한 일상적인 상황을 포괄한다. 행동 레이블의 부족 문제를 해결하기 위해, 연속형 잠재 행동(continuous latent actions)을 통합된 대체 행동으로 도입함으로써, 레이블이 없는 영상에서 상호작용 지식의 전이를 강화하였다. 작은 규모의 타겟 로봇 데이터에 대해 사후 훈련을 수행한 후, DreamDojo는 물리 법칙에 대한 강력한 이해와 정밀한 행동 제어 능력을 보였다. 또한, DreamDojo의 실시간 속도를 10.81 FPS로 가속화하고, 더 나아가 맥락 일관성을 향상시키기 위한 디스틸레이션 파이프라인을 설계하였다. 본 연구는 생성형 세계 모델 기반의 여러 중요한 응용 가능성을 열었으며, 실시간 원격 조작, 정책 평가, 기반 모델 기반 계획 등이 포함된다. 다양한 도전적인 분포 외(OOD) 벤치마크에서의 체계적인 평가를 통해, 본 방법이 개방형 세계, 접촉이 풍부한 작업을 시뮬레이션하는 데 있어 중요한 의미를 지닌다는 것이 입증되었으며, 일반 목적의 로봇 세계 모델 개발의 길을 열었다.

One-sentence Summary

Researchers from NVIDIA, HKUST, UC Berkeley, and others propose DREAMDoJO, a foundation world model trained on 44k hours of egocentric videos, using latent actions to overcome label scarcity and enabling real-time, physics-aware robotic simulation for teleoperation and planning in open-world tasks.

Key Contributions

  • DREAMDoJO is a foundation world model pretrained on 44k hours of egocentric human videos, the largest dataset to date for this task, enabling zero-shot generalization to unseen objects and environments by leveraging consistent physics across human and robot interactions.
  • To overcome the scarcity of action labels, it introduces continuous latent actions as unified proxy actions, allowing self-supervised learning of fine-grained controllability and physics from unlabeled videos, which significantly improves transfer to dexterous robot tasks after minimal post-training.
  • A novel distillation pipeline accelerates the model to 10.81 FPS at 640×480 resolution while enhancing long-horizon consistency, enabling real-time applications like teleoperation and model-based planning, validated on challenging out-of-distribution benchmarks.

Introduction

The authors leverage large-scale human video data to train DREAMDoJO, a foundation world model capable of simulating dexterous robotic tasks in open, unseen environments. Prior video world models struggle with high-dimensional robot actions and limited, expert-only datasets, which restrict generalization and counterfactual reasoning. To overcome sparse action labels, they introduce continuous latent actions as unified proxies, enabling scalable, self-supervised learning of physics and controllability from 44k hours of egocentric human videos—the largest such dataset to date. Their distillation pipeline accelerates inference to 10.81 FPS while preserving visual quality and long-horizon consistency, unlocking real-time applications like teleoperation and model-based planning.

Dataset

  • The authors use DreamDojo-HV, a 44,711-hour egocentric video dataset, to pretrain a world model capable of generalizing across objects, tasks, and environments. It is the largest human interaction dataset for this purpose to date.

  • The dataset combines three sources:

    • In-lab: Lab-collected tabletop interactions using Manus gloves and Vive trackers for precise hand pose capture; enables direct retargeting to GR-1 robot actions and includes novel objects/verbs.
    • EgoDex: 829 hours of public egocentric videos from Apple Vision Pro, with 3D hand/finger tracking and diverse household objects to expand object variety.
    • DreamDojo-HV (in-house): Crowdsourced videos covering loco-manipulation skills across household, industrial, retail, educational, and administrative settings; each clip includes task text annotations.
  • The full dataset includes 9,869 unique scenes, 6,015 unique tasks, and 43,237 unique objects — 15x longer, 96x more skills, and 2,000x more scenes than prior world model datasets.

  • For training, the authors mix these subsets without explicit ratios mentioned, but emphasize scale and diversity drive performance gains. No cropping strategy is described; metadata is built from task text annotations per clip and scene/task/object identifiers.

  • The dataset enables real-time future prediction with continuous actions, supports teleoperation, and allows online model-based planning without real-world deployment.

Method

The authors leverage a three-phase training pipeline to build DREAMDoJO, a world model capable of simulating robot interactions from human video data and adapting to target embodiments. The overall framework begins with pretraining on diverse egocentric human datasets, followed by post-training on robot-specific data, and concludes with a distillation stage to enable real-time autoregressive generation. Refer to the framework diagram for a high-level overview of this pipeline.

At the core of the architecture is the Cosmos-Predict2.5 model, a latent video diffusion model operating in the continuous latent space produced by the WAN2.2 tokenizer. The model conditions on text, frames, and actions via cross-attention and adaptive layer normalization, trained using flow matching loss. To enhance action controllability — critical for robotic applications — the authors introduce two key architectural modifications. First, they transform absolute robot joint poses into relative actions by rebaselining each latent frame’s input to the pose at its start (every 4 timesteps), which reduces modeling complexity and improves generalization. Second, to respect causality, they inject actions into latent frames in chunks of 4 consecutive actions, rather than as a global condition, thereby eliminating future action leakage and improving learning efficiency.

To enable pretraining on unlabeled human videos, the authors introduce a latent action model based on a spatiotemporal Transformer VAE. This model extracts compact, disentangled action representations from consecutive video frames ft:t+1f^{t:t+1}ft:t+1, producing a latent action a^t\hat{a}_ta^t that the decoder uses — along with ftf^tft — to reconstruct ft+1f^{t+1}ft+1. The training objective combines reconstruction loss and KL regularization to enforce an information bottleneck, ensuring the latent encodes only the most critical motion. As shown in the figure below, this latent action model successfully captures human actions and enables cross-embodiment generalization, allowing the same action representation to be used across diverse robotic platforms.

During pretraining, the authors condition each latent frame on chunked latent actions projected via a lightweight MLP, initialized with zero weights to preserve pretrained physics knowledge. They further enhance temporal coherence by augmenting the standard flow matching loss with a temporal consistency loss that penalizes discrepancies in velocity differences between consecutive latent frames:

Ltemporal(θ)=E[i=1K1(zi+1zi)(vi+1vi))2].\mathcal { L } _ { \mathrm { t e m p o r a l } } ( \theta ) = \mathbb { E } \Big [ \sum _ { i = 1 } ^ { K - 1 } \big \| ( z ^ { i + 1 } - z ^ { i } ) - ( v ^ { i + 1 } - v ^ { i } ) \big ) \big \| ^ { 2 } \Big ] .Ltemporal(θ)=E[i=1K1(zi+1zi)(vi+1vi))2].

The final training objective is a weighted sum of the flow matching and temporal consistency losses:

Lfinal(θ)=Lflow(θ)+λLtemporal(θ),\mathcal { L } _ { \mathrm { f i n a l } } ( \theta ) = \mathcal { L } _ { \mathrm { f l o w } } ( \theta ) + \lambda \, \mathcal { L } _ { \mathrm { t e m p o r a l } } ( \theta ) ,Lfinal(θ)=Lflow(θ)+λLtemporal(θ),

with λ=0.1\lambda = 0.1λ=0.1 in practice.

For post-training on target robots, the authors reinitialize the first layer of the action MLP to match the robot’s action space and finetune the entire model. This stage enables adaptation to specific embodiments — such as GR-1, G1, AgiBot, and YAM — using only limited robot data, while preserving the generalization benefits of pretraining.

Finally, to enable real-time applications like live teleoperation and model-based planning, the authors distill the foundation model into an autoregressive student model. This involves replacing bidirectional attention with causal attention and reducing the denoising steps from 50 to 4. The distillation proceeds in two stages: a warmup phase where the student regresses to teacher-generated ODE trajectories, and a distillation phase where the student generates autoregressively and is supervised via a KL-based distribution matching loss. To mitigate compounding error, the student is trained to generate longer rollouts than the teacher, with supervision applied via randomly sampled windows.

Experiment

  • Latent action conditioning enables significantly better transfer from human videos compared to action-free pretraining, nearly matching ideal ground-truth action settings.
  • Incorporating diverse human datasets during pretraining improves generalization to novel physical interactions and counterfactual actions.
  • Architectural enhancements—including relative actions, chunked injection, and temporal consistency loss—substantially boost simulation accuracy and action controllability.
  • The distillation pipeline yields a lightweight, real-time model that maintains high fidelity over long-horizon rollouts while offering better context awareness and robustness to occlusions.
  • DREAMDoJO demonstrates strong out-of-distribution generalization, especially in edited or novel environments, validated through human preference evaluations.
  • Downstream applications show DREAMDoJO reliably evaluates policies, enables effective model-based planning with significant performance gains, and supports real-time teleoperation.

Results show that incorporating more diverse human video datasets during pretraining consistently improves performance across out-of-distribution scenarios and counterfactual actions. The DREAMDoJO-14B variant achieves the strongest overall scores, particularly in SSIM and LPIPS metrics, indicating better structural and perceptual fidelity in generated videos. Adding the DreamDojo-HV dataset to the pretraining mixture further boosts generalization, especially on novel interactions not seen in robot training data.

The authors use latent action conditioning to bridge the performance gap between action-free pretraining and ideal ground-truth action setups, achieving near-parity with retargeted actions in simulation quality. Results show that latent actions significantly improve transfer from human videos while remaining scalable and practical without requiring specialized motion capture hardware.

The authors evaluate architectural and loss design choices by incrementally applying relative actions, chunked injection, and temporal consistency loss to a base model. Results show that each modification contributes to improved simulation quality on both expert and counterfactual trajectories, with the full combination achieving the best performance across all metrics. This confirms that precise action controllability and temporal coherence are critical for accurate dynamics prediction.

The authors use latent action conditioning to bridge the performance gap between action-free pretraining and ideal ground-truth action setups, achieving near-parity with MANO-based conditioning in simulation quality. Results show that latent actions significantly improve transfer from human videos while remaining scalable and practical without requiring specialized motion capture hardware.

The authors use latent action conditioning to bridge the gap between human video pretraining and robot action execution, achieving performance close to ideal setups that require precise motion capture. Results show that pretraining with latent actions significantly improves simulation quality across diverse evaluation benchmarks compared to action-free or no pretraining, particularly in out-of-distribution scenarios. This approach enables scalable and effective transfer of human interaction knowledge to robotic systems without relying on specialized hardware for action labeling.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp