HyperAIHyperAI

Command Palette

Search for a command to run...

Fine-tune NVIDIA Cosmos Predict 2.5 for robot video with LoRA

NVIDIA released a technical guide demonstrating how to efficiently fine-tune the Cosmos Predict 2.5 world model using Low-Rank Adaptation (LoRA) and Directional Residual Weighting (DoRA) for robot video generation. Cosmos Predict 2.5 is a large-scale model capable of generating physically plausible videos based on text, images, or video clips. While the base model offers general capabilities, adapting it to specific robotic domains like manipulation tasks requires targeted fine-tuning. Collecting real-world robot data is slow and expensive, making synthetic trajectory generation an attractive alternative. However, full fine-tuning of the 2-billion-parameter model is computationally prohibitive and risks catastrophic forgetting. LoRA and DoRA address this by injecting small, trainable adapter modules into the frozen base model, reducing memory requirements and allowing flexible adapter swapping for different domains. The guide outlines a workflow using the diffusers and accelerate libraries, supporting both single and multi-GPU training. The process begins with data preparation, where datasets are preprocessed into video and metadata formats similar to those used in GR00T Dreams projects. During training, the model's Video Autoencoder, text encoder, and DiT diffusion transformer remain frozen. Adapters are injected specifically into the DiT's attention and feedforward layers. Users can opt for standard LoRA or switch to DoRA, which decomposes weights into magnitude and direction for potentially better convergence, particularly at low ranks. The training objective follows a rectified flow formulation, predicting the velocity required to transition from noise to clean data while conditioning on the first two frames of the input video. Training can be executed via a provided shell script, which accepts parameters for the learning rate, LoRA rank, and alpha scaling factor. Empirical results indicate that 100 epochs are sufficient to achieve decent performance. On a single NVIDIA H100, training takes approximately 17 hours, while eight H100 GPUs reduce this time to just 2.5 hours. Once training is complete, the saved adapter weights can be loaded alongside the base pipeline to generate synthetic robot trajectories. For inference, the adapters can be fused into the base model to eliminate computational overhead. Evaluation of the fine-tuned models utilizes three primary metrics: Sampson Error, which measures geometric consistency between frames and views; an LLM-as-a-Judge score for physical plausibility; and an instruction-following score. Qualitative analysis shows that the base model often hallucinates human hands or fails to follow specific hand-side instructions. In contrast, both LoRA and DoRA fine-tuning effectively correct these issues, producing videos with stable motion and accurate object interactions. Quantitative results confirm that fine-tuning significantly lowers Sampson Error and improves both physical and instruction-following scores. Comparisons between LoRA and DoRA reveal that both methods converge to similar performance levels when using a rank of 32. A higher rank improves the model's ability to follow specific instructions, such as identifying the correct hand or object, but does not necessarily enhance geometric consistency, which is likely captured by the frozen base weights. LoRA with rank 8 is sufficient for most adaptation needs and is recommended for scenarios with strict memory constraints. DoRA may be preferred if training instability occurs at low ranks or if larger adapter budgets are available. This approach provides a scalable, cost-effective solution for generating high-quality synthetic robot data.

Related Links