HyperAIHyperAI

Command Palette

Search for a command to run...

LeRobotDataset:v3.0 Launches with Multi-Episode Storage and Streaming Support for Scalable Robotics Research

Today we announce the release of LeRobotDataset:v3.0, a major update to our standardized dataset format designed for robot learning. In previous versions, each episode was stored in a separate file, which created performance bottlenecks when scaling to millions of episodes. LeRobotDataset:v3 overcomes this by packing multiple episodes into single files, using relational metadata to efficiently retrieve episode-level information. This new format also introduces native support for streaming, enabling users to process massive datasets on the fly without downloading them entirely. The update brings key improvements to scalability and usability. Tabular data—such as joint states and actions—is stored in efficient Apache Parquet files, while visual data is concatenated into MP4 videos to reduce file system strain. Metadata, stored in JSON and Parquet files, now includes episode boundaries, task descriptions, frame rates, and normalization statistics, making it easier to search and index datasets on the Hugging Face Hub. A major highlight is the ability to use any dataset in streaming mode via the StreamingLeRobotDataset class. This allows direct access to data from the Hugging Face Hub without local storage, drastically reducing barriers to entry for large-scale robot learning. To help users transition, we provide a one-liner tool to convert existing v2.1 datasets to the new v3.0 format: python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id= The new format is already available in the latest pre-release version of lerobot (v0.3.x), which can be installed via pip from GitHub. This version supports recording new datasets using real robots like the SO-101 arm and storing them directly on the Hugging Face Hub. LeRobotDataset v3.0 integrates seamlessly with PyTorch, allowing users to load data with a single line of code. The dataset supports advanced features like temporal windowing through the delta_timestamps argument, enabling models to access past and future frames relative to a given observation. When used with torch.utils.data.DataLoader, it automatically batches tensors, simplifying training workflows for reinforcement learning and behavioral cloning. This release marks a significant step toward democratizing access to large-scale robotics data. By enabling efficient storage, fast retrieval, and streaming capabilities, LeRobotDataset:v3.0 empowers researchers and developers to train on millions of episodes without the burden of local storage or complex data management. We invite the community to try the new format, share feedback on GitHub or in our Discord server, and help shape the future of open robotics research. Special thanks to the yaak.ai team for their support and collaboration during development. We’re excited to continue building this ecosystem together.

Related Links