HyperAI

Universal Robots Milestone! MIT Proposes a Strategy Combination Framework PoCo to Solve the Problem of Heterogeneous Data Sources and Enable Flexible Multi-task Execution of Robots

9 months ago
Information
zhaorui
特色图像

18 humanoid robots acted as "greeters" and waved to the guests in unison. This was a shocking scene at the 2024 World Artificial Intelligence Conference, allowing people to intuitively feel the rapid development of robots this year.

GIF cover
Image source: Jiazi Light Year

In 1954, the world's first programmable robot "Unimate" was officially put into operation on the General Motors assembly line. After more than half a century, robots have gradually grown from bulky industrial giants to smarter and more flexible human assistants. Among them, artificial intelligence technology, especially the breakthrough progress of natural language processing and computer vision, has paved a high-speed track for the development of robots. Using huge computing power and massive data,Training general robotics strategies through simple algorithms such as behavioral cloning,The unlimited potential of future robots is gradually being unlocked.

Unimate robot source: Baidu Encyclopedia

However, most current robot learning pipelines are trained for a specific task.This makes them unable to cope with new situations or perform different tasks.In addition, robot training data mainly comes from simulation, human demonstration and robot teleoperation scenarios.There is huge heterogeneity between different data sources.It is also difficult for a machine learning model to integrate data from so many sources, and training robots with general strategies has always been a major challenge.

In response to this,MIT researchers proposed a robot policy composition framework PoCo (Policy Composition).This framework uses the probability synthesis of diffusion models to combine data from different fields and modalities, and develops task-level, behavior-level, and field-level policy synthesis methods for building complex robot policy combinations, which can solve the data heterogeneity and task diversity problems in robot tool use tasks. Related research has been published on arXiv under the title "PoCo: Policy Composition from and for Heterogeneous Robot Learning".

Research highlights:
* No need to retrain, the PoCo framework can flexibly combine strategies for training data from different fields

* PoCo excels in tool use tasks in both simulation and the real world, and is highly generalizable to tasks in different environments compared to methods trained on a single domain

Paper address:
https://arxiv.org/abs/2402.02511

The open source project "awesome-ai4s" brings together more than 100 AI4S paper interpretations and provides massive data sets and tools:
https://github.com/hyperai/awesome-ai4s

Three major data sets, covering human & machine data, real & simulated data, etc.

The data sets involved in this study mainly come from human demonstration video data, real robot data, and simulation data.

Human Demonstration Video Dataset

Human demonstration videos can be collected from uncalibrated cameras in the wild, with up to 200 tracks collected in total.

Dataset processing pipeline, human demonstration videos collected in the wild by uncalibrated RGB-D cameras can be converted into labeled tracks

Real Robot Dataset

The local and global views of the scene are obtained through the installed wrist camera and overhead camera, and the tool posture, tool shape, and tactile information when the tool contacts the object are collected using GelSight Svelte Hand. 50-100 trajectory demonstrations are collected for each task.

Simulation Dataset

The simulated data set follows Fleet-Tools, where expert demonstrations are generated through key point trajectory optimization, and a total of about 50,000 simulated data points are collected. In the subsequent training process, the researchers performed data augmentation on both point cloud data and motion data, and saved fixed simulation scenes for testing.

In addition, the researchers added point-wise noises and random dropping to 512 tools and 512 object point clouds from depth images and masks to improve the robustness of the model.

Combining strategies through probability distribution product

In the strategy combination, the researchers gave trajectory information encoded by two probability distributions pDM(⋅|c,T) and pD′M′(⋅|c′,T′), and directly combined the information of these two probability distributions by sampling in the product distribution during inference.

in,pproduct  It shows high likelihood on all trajectories that satisfy both probability distributions.It can effectively encode information of both distributions.

Strategy portfolio PoCo

PoCo proposed by researchers,Bringing together information across behaviors, tasks, channels, and domains,No retraining is required; information is combined in a modular manner during prediction, and generalization of tool usage tasks can be achieved by leveraging information from multiple fields.

Strategy Combination Diagram

It is assumed that the diffusion output of each model is in the same space, that is, the action dimension and action time domain are the same. At test time, PoCo is combined with gradient predictions. This method can be applied to the combination of strategies in different fields, such as combining strategies trained with different modal data such as images, point clouds, and tactile images. It can also be used for the combination of strategies for different tasks, as well as providing additional cost functions for the desired behavior through the combination of behaviors.

In this regard, the researchers provided three examples: Task-level Composition, Behavior-level Composition, and Domain-level Composition to illustrate how PoCo can improve policy performance.

Task-level Composition

Task-level composition adds extra weights to trajectories that can potentially complete task T, which can improve the final quality of the synthesized trajectory without the need for separate training for each task.Instead, a general policy is trained that can achieve multi-task goals.

Behavior-level Composition

This combination can combine information about task distribution and cost objective.Ensure that the synthesized trajectory both completes the task and optimizes the specified cost target.

Domain-level Composition

This combination can leverage information captured from different sensor modalities and domains.Very useful to complement data collected in separate fields.For example, when real robot data is expensive to collect but more accurate, and simulation demonstration data is cheaper to collect but less accurate, feature concatenation can be performed on data from different modes in the same field to simplify processing.

Visualization tool usage task, evaluate 3 major strategy combinations

Visualization tool usage tasks

During training, the researchers used a temporal U-Net structure with a denoising diffusion probabilistic model (DDPM) and performed 100 steps of training; during testing, they used a denoising diffusion implicit model (DDIM) and performed 32 steps of testing. In order to combine different diffusion models between different domains D and tasks T, the researchers used the same action space for all models and performed fixed normalization on the robot's action boundaries.

The researchers evaluated the proposed PoCo through robot usage tasks for common tools (wrench, hammer, shovel, and spanner). The task was determined to be successful when a certain threshold was reached, for example, the hammering task was considered successful when a pin was driven in.

Behavior-level combinations can improve desired behavioral goals

The researchers used test-time inference to combine behaviors such as smoothness and workspace constraints, with the synthesis weight fixed to γc=0.1.

Behavior combination effect. By combining costs probabilistically, the metrics for each cost target can be optimized

As shown in the table above, test-time behavior-level combinations can improve desired behavioral goals such as smoothness and workspace constraints.

Task-level combinations are optimal in multi-task policy evaluation

When the task weight α=0, the task-level combination policy is mapped to the unconditioned multitask policy, when α=1, it is mapped to the standard task-conditioned policy, when 0 < α < 1, the researchers interpolate between the task conditional and task unconditional policies. When α > 1, a trajectory that is more relevant to the task condition can be obtained.

In simulated multi-task policy evaluations, task-level combinations performed best

The above figure shows that compared with the unconditional and task-specific conditional multi-task tool use diffusion strategy, the conditional and unconditional multi-task tool use strategy has a better task combination.

Human + simulated data, domain-level combination has better performance

The researchers trained separate policy models using a simulated dataset θsim, a human dataset θhuman, and a robot dataset θrobot, and evaluated domain-level combinations in a simulated setting.

Use simulation strategies to help cross-domain combination and simulation evaluation

Since θsim has no training/test domain gap, it performs well and can achieve a 92% success rate. In domains such as human data, the researchers combined it with a better performing policy θsim, significantly improving performance.

The performance of the strategy combination exceeds that of its individual components and is more versatile

The researchers used PoCo in robot tool use tasks to combine data from different fields and tasks to improve its generalization ability. The four tasks were: tightening a screw with a wrench, hammering a nail with a hammer, scooping a pancake out of a pan with a shovel, and cutting plasticine with a knife.

By combining policies trained in simulation, humans, and real data, we can generalize across multiple distractors (row 1), different object and tool poses (row 2), and new object and tool instances (row 3).

Quantitative results of domain combination. Compared with the individual component strategies, the strategy combination significantly improves the average success rate of the task

As shown in the table above, although the strategies trained with human data and the strategies trained with real-robot data perform poorly in different scenarios (compared with simulation),But their combination (Human+Real) can exceed each individual component.

Strategy performance of different tool usage tasks. In tool usage tasks, the overall performance of task combination strategies is improved

Through the real world, researchers evaluated the robot's strategic performance on four different tool-using tasks and found that in tool-using tasks,The performance of task combination strategy is improved better.As shown in the table above, the multi-task strategy has almost the same performance as the specific tasks conditioned by Tspatula and Thammer, and they all show a certain stability in fine actions. In addition, the combined hyperparameters need to be kept within a range to be effective and stable.

The best conditions for universality, the rise of humanoid robots

General-purpose robots have flourished in the past two years, but an interesting phenomenon is that the industry currently seems to be more in favor of promoting the development of general-purpose robots in a humanoid way.Why must general-purpose robots be humanoid?Chen Zhe, managing director of 5Y Capital, said, "Because only humanoid robots can adapt to different interaction scenarios in the human living environment!" Since robots are going to help humans work, it is obviously best for them to imitate humans and learn in a human form.

As a benchmark for the industry, as early as September 2022, Tesla released the general-purpose humanoid robot Optimus. Although it could not even walk steadily at first, it has a complete humanoid robot prototype and meets the basic dexterous work that humans can do. With the continuous iteration of Tesla's software and hardware technology, Optimus will have more exciting functions, and this has been proved to be the case.

GIF cover

At the 2024 World Artificial Intelligence Conference, Tesla showed everyone the latest research progress of its humanoid robot Optimus: the upright walking speed increased by 30%, and the ten fingers also evolved perception and touch, which can gently hold fragile eggs and steadily carry heavy boxes. It is understood that Optimus has tried practical applications in Tesla factories, such as using visual neural networks and FSD chips to imitate human operations to train battery sorting. It is expected that next year there will be more than 1,000 humanoid robots in Tesla factories to help humans complete production tasks.

Similarly, Shanghai Fourier Intelligent Technology Co., Ltd., an industry-leading general robotics company founded in 2015, also brought its humanoid robot GR-1 to the conference. Since its launch in 2023, GR-1 has taken the lead in mass production and delivery, achieving advanced upgrades in environmental perception, simulation models, motion control optimization, and other aspects.

In addition, in March of this year, NVIDIA also launched a humanoid robot project called GR00T at its annual GTC developer conference. By observing human behavior to understand natural language and imitate actions, the robot can quickly learn coordination, flexibility and other skills to navigate, adapt and interact with the real world.

With the continuous advancement of science and technology, we have reason to believe that humanoid robots may become a bridge connecting humans and machines, reality and the future, leading us into a smarter and better society.

References:
https://m.163.com/dy/article/J69LAFDR0512MLBG.html
https://36kr.com/p/1987021834257154
https://hub.baai.ac.cn/view/211