HyperAI超神经

Tutorial Introduction

The Cosmos world base model was released by NVIDIA in 2025. It is open to the physical AI developer community and is an advanced model trained with millions of hours of driving and robotics video data.

The series of models are neural networks that can predict and generate physically-aware videos of future states of virtual environments to help developers build a new generation of robots and autonomous vehicles (AVs).

Like large language models, world-based models (WFMs) belong to a fundamental class of models that use input data including text, images, videos, and motion to generate and simulate virtual worlds to accurately model the spatial relationships of objects in the scene and their physical interactions.

At CES 2025, NVIDIA unveiled the first batch of Cosmos world-base models for physics-based simulation and synthetic data generation, equipped with advanced tokenizers, guardrails, accelerated data processing and management workflows, and model customization and optimization frameworks.

Cosmos World Base Models is a set of open-ended diffusion and autoregressive Transformer models for physically-aware video generation. These models have been trained on 900 trillion tokens based on 20 million hours of real-world human interaction, environmental, industrial, robotics, and driving data. The models are divided into three categories: Nano, for models optimized for real-time, low-latency inference and edge deployment; Super, for high-performance baseline models; and Ultra, with high quality and fidelity suitable for distilling custom models.

 该教程使用的是「Cosmos-1.0-Diffusion-7B-Text2World」演示，由于模型较大，所以需要使用 A6000 启动。

Running method (it takes about 15 seconds to initialize after starting the container, and then perform the following operations)

1. After cloning and starting the container

Open workspace > Open terminal

2. Enter the following command to activate the environment

conda activate ./cosmos

3. Enter the following command to switch to the Cosmos directory

cd Cosmos

4. Enter the following command to start the model gradio interface

PYTHONPATH=$(pwd) python cosmos1/models/diffusion/inference/gradio_text2world.py --checkpoint_dir checkpoints --diffusion_transformer_dir Cosmos-1.0-Diffusion-7B-Text2World --offload_prompt_upsampler --offload_text_encoder_model --offload_guardrail_models --video_save_name Cosmos-1.0-Diffusion-7B-Text2World --checkpoint_dir /input0

After port 8080 appears, open the API address on the right to access the gradio interface.

Generate Video

After entering the gradio interface, enter the prompt word in "Enter your prompt" and click "Submit" to perform inference. You can see the generated video after waiting for a few minutes.

（参考时间：使用 A6000 生成一段 5s 的视频约需要 30 分钟，生成视频时长默认为 5s，不可更改）

Discussion and Exchange

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [Tutorial Exchange] to join the group to discuss various technical issues and share application effects↓