Command Palette
Search for a command to run...
코스모스 정책: 시각운동 제어 및 계획을 위한 비디오 모델의 미세 조정
코스모스 정책: 시각운동 제어 및 계획을 위한 비디오 모델의 미세 조정
초록
최근의 영상 생성 모델들은 시간에 따른 복잡한 물리적 상호작용과 장면의 변화를 뛰어난 능력으로 포착하고 있다. 이러한 모델의 공간-시간적 사전 지식을 활용하기 위해 로봇 분야의 연구들은 영상 모델을 정책 학습에 적응시켰지만, 행동 생성을 위해 복수의 후속 학습 단계와 새로운 아키텍처 구성 요소를 도입함으로써 복잡성을 증가시켰다. 본 연구에서는, 타겟 플랫폼에서 수집한 로봇 시연 데이터를 단일 단계의 후속 학습만으로 대규모 사전 학습된 영상 모델(Cosmos-Predict2)을 효과적인 로봇 정책으로 변환하는 간단한 접근법인 Cosmos Policy를 제안한다. 이 방법은 아키텍처의 수정 없이도 가능하다. Cosmos Policy는 영상 모델의 잠재 확산 과정 내에서 액션을 잠재 프레임으로 직접 생성하도록 학습하며, 사전 학습된 사전 지식과 핵심 학습 알고리즘을 활용하여 복잡한 행동 분포를 효과적으로 포착한다. 또한, Cosmos Policy는 미래 상태 이미지와 가치(예상 누적 보상)를 동일하게 잠재 프레임 형태로 생성하여, 테스트 시점에서 성공 확률이 높은 행동 경로를 계획할 수 있도록 한다. 평가 결과, Cosmos Policy는 LIBERO 및 RoboCasa 시뮬레이션 벤치마크에서 각각 98.5%, 67.1%의 평균 성공률로 최신 기술 수준을 달성했으며, 도전적인 실세계 양손 조작 작업에서 가장 높은 평균 점수를 기록하여, 처음부터 학습한 강력한 확산 정책, 영상 모델 기반 정책, 동일한 로봇 시연 데이터로 미세 조정된 최신 비전-언어-액션 모델들을 모두 능가했다. 더불어, 정책 전개 데이터를 제공받으면 Cosmos Policy는 경험을 통해 자신의 세계 모델과 가치 함수를 개선하고, 모델 기반 계획을 활용하여 도전적인 작업에서 더욱 높은 성공률을 달성할 수 있다. 본 연구의 코드, 모델, 학습 데이터는 https://research.nvidia.com/labs/dir/cosmos-policy/ 에 공개된다.
One-sentence Summary
Researchers from NVIDIA and Stanford introduce Cosmos Policy, a robot policy fine-tuned from Cosmos-Predict2-2B via single-stage training, generating actions, future states, and values as latent frames without architectural changes, achieving state-of-the-art results in simulation and real-world bimanual tasks.
Key Contributions
- Cosmos Policy adapts the pretrained video model Cosmos-Predict2 into a robot policy via a single-stage fine-tuning on robot demonstration data, requiring no architectural changes and directly encoding actions as latent frames within the model’s diffusion process to leverage its spatiotemporal priors.
- It jointly predicts robot actions, future state images, and expected cumulative rewards as latent frames, enabling test-time planning through best-of-N sampling that selects high-value action trajectories, improving task success rates in both simulation and real-world bimanual manipulation.
- Evaluated on LIBERO (98.5% avg success) and RoboCasa (67.1% avg success) benchmarks and real-world tasks (93.6% avg success), it outperforms diffusion policies, video-based policies, and fine-tuned vision-language-action models, with further gains from model-based planning and experience-driven refinement.
Introduction
The authors leverage pretrained video generation models—which capture spatiotemporal dynamics and implicit physics—to build robot policies without modifying the model’s architecture. Prior approaches either require multiple training stages with custom action modules or train unified models from scratch, missing the benefit of pretrained video priors. Cosmos Policy fine-tunes a single video model in one stage to jointly generate robot actions, future states, and value predictions—all encoded as latent frames—enabling both direct control and model-based planning via best-of-N sampling. It achieves state-of-the-art performance across simulation and real-world bimanual tasks, and further improves with experience-driven refinement of its world model and value function.
Dataset
-
The authors use three distinct datasets: LIBERO (simulation), RoboCasa (simulation), and ALOHA (real-world), each targeting different robotic manipulation challenges.
-
LIBERO includes four task suites (Spatial, Object, Goal, Long) with 500 demonstrations total (10 tasks × 50 demos). For policy training, only successful demos are used; world model and value function training use all demos, filtered or not.
-
RoboCasa features 24 kitchen tasks with 50 human teleoperated demos per task. The authors train exclusively on these 50 human demos per task to evaluate data efficiency, filtering failures for policy training but retaining all for world model/value function training.
-
ALOHA comprises four bimanual real-world tasks: “put X on plate” (80 demos), “fold shirt” (15 demos), “put candies in bowl” (45 demos), and “put candy in ziploc bag” (45 demos). Policies receive proprioception, three camera views, and task descriptions, outputting 50-timestep action chunks.
-
All evaluations use fixed initial states for fair comparison. LIBERO results average 6000 trials (3 seeds × 10 tasks × 50 episodes); RoboCasa averages 3600 trials (24 tasks × 5 scenes × 10 trials × 3 seeds); ALOHA uses 101 total trials across tasks.
-
The authors reduce ALOHA’s control frequency from 50 Hz to 25 Hz for efficiency. No cropping is mentioned; metadata includes task descriptions, camera images, and joint states. Training splits follow benchmark protocols, with no additional data mixing or augmentation described.
Method
The authors leverage the pretrained Cosmos-Predict2-2B-Video2World model, a latent video diffusion model, as the foundation for Cosmos Policy. This model is initialized with a diffusion transformer architecture that operates over continuous tokens encoded by the Wan2.1 spatiotemporal VAE tokenizer. The core training objective for the denoiser network Dθ is formulated as an EDM denoising score matching problem, where the network learns to recover a clean VAE-encoded image sequence x0 from a corrupted version x0+n, conditioned on a textual description c and the noise level σ. The network conditions on the text via cross-attention and on the noise level via adaptive layer normalization. The Wan2.1 tokenizer compresses a video sequence of size (1+T)×H×W×3 into a latent sequence of size (1+T′)×H′×W′×16, where T′=T/4, H′=H/8, and W′=W/8. The first frame is preserved without temporal compression to allow conditioning on a single input image, and a conditioning mask ensures this initial frame remains clean during training.
To adapt this video model for robotic control, the authors introduce a novel mechanism called latent frame injection, which allows the model to incorporate new modalities without architectural changes. This process involves inserting additional latent frames directly into the video model's latent diffusion sequence. For a robotic platform with multiple camera views and robot-specific data, the latent sequence is structured to include both image frames and non-image modalities. For instance, in a setup with two third-person cameras and a wrist-mounted camera, the sequence contains 11 latent frames: a blank placeholder, robot proprioception, wrist camera image, first third-person camera image, second third-person camera image, action chunk, future robot proprioception, future wrist camera image, future first third-person camera image, future second third-person camera image, and future state value. The new modalities—robot proprioception, action chunk, and state value—are encoded as latent frames by filling each H′×W′×C′ latent volume with normalized and duplicated copies of the respective modality data. This ordering of modalities in the sequence represents (s,a,s′,V(s′)), enabling autoregressive decoding of actions, future states, and future state values from left to right.
The model is trained on robot data to jointly learn a policy, a world model, and a value function. For each training step, a batch of (s,a,s′,V(s′)) tuples is sampled. Half of the batch is drawn from expert demonstrations to train the policy, which learns to predict the joint distribution p(a,s′,V(s′)∣s). The other half is sampled from a rollouts dataset and split to train the world model p(s′,V(s′)∣s,a) and the value function p(V(s′)∣s,a,s′). The conditioning scheme, which determines which parts of the latent sequence are used as input and which are the targets, dictates which function is being trained. The policy and world model training involve auxiliary targets, which improve performance. During initial training, the value function predictions are conditioned on the full latent prefix (s,a,s′). However, when fine-tuning on rollout data, the value generation can be conditioned on a subset of (s,a,s′) via input masking, allowing the value function to represent either the state value V(s′) or the state-action value Q(s,a).
The authors also modify the noise distribution used during training and inference to improve the accuracy of action predictions. The base model's log-normal noise distribution, which concentrates training weight at lower noise levels, is insufficient for the precise action generation required in robotics. To address this, they use a hybrid log-normal-uniform distribution that assigns greater weight to higher noise levels. This is implemented by sampling from the original log-normal distribution with probability 0.7 and from a uniform distribution over [1.0, 85.0] with probability 0.3. At inference time, they set a higher lower bound for the noise level, σmin=4, to avoid the low signal-to-noise ratio at very small noise levels, which empirically improves prediction accuracy for actions, future states, and values.
Experiment
- Cosmos Policy outperforms state-of-the-art imitation learning methods on LIBERO (98.5% success rate), RoboCasa (67.1% success rate with only 50 demonstrations), and real-world ALOHA tasks (highest overall score across four tasks), despite using fewer training samples than most comparators.
- Ablation studies confirm that auxiliary training objectives (joint prediction of actions, next states, and values) and video model priors are critical: removing them causes 1.5% and 3.9% absolute drops in success rate, respectively; training from scratch yields jerky, unsafe motions and 18.7-point lower score on “fold shirt.”
- With model-based planning using refined world models and V(s’) value functions, Cosmos Policy achieves a 12.5-point average score gain on challenging ALOHA tasks (“put candies in bowl” and “put candy in ziploc bag”), outperforming model-free Q(s,a) planning variants due to better leveraging of learned dynamics.
- Planning requires 4.9 seconds per action on 8 H100 GPUs, limiting real-time applicability; inference latency can be reduced to 0.16s per action chunk with only 0.5% success rate drop using 1 denoising step.
Results show that Cosmos Policy achieves the highest average success rate of 67.1% across 24 kitchen manipulation tasks in the RoboCasa benchmark, outperforming all compared methods while using significantly fewer training demonstrations (50 per task). The authors use this result to demonstrate that their approach establishes a new state of the art in multi-task manipulation performance with efficient data usage.

The authors use the table to evaluate the impact of auxiliary losses and pretrained models on Cosmos Policy's performance. Results show that removing auxiliary losses leads to a 1.5% drop in average success rate, while training without a pretrained model results in a 3.9% drop, indicating both components are important for optimal performance.

The authors evaluate Cosmos Policy with model-based planning on challenging ALOHA robot tasks, comparing model-based planning using a state value function V(s′) against model-free planning using a Q-value function Q(s,a). Results show that model-based planning with V(s′) achieves higher scores than both the base policy and the model-free variant, with an average score of 84.0 compared to 70.7 for model-free planning.

Results show that Cosmos Policy achieves the highest average score across all four ALOHA robot tasks, outperforming all compared methods in three of the four tasks. It attains the highest overall score of 93.6, significantly surpassing Diffusion Policy and OpenVLA-OFT+ in the "fold shirt" and "put candy in ziploc bag" tasks, while also achieving strong performance on the other two tasks.

Results show that Cosmos Policy achieves the highest average success rate of 98.5% across all tasks in the LIBERO benchmark, outperforming all compared methods, including Diffusion Policy, Dita, and UVA, with particularly strong performance in the Object and Goal SR metrics. The authors use this table to demonstrate that Cosmos Policy sets a new state-of-the-art in imitation learning for multi-task manipulation tasks.
