17 days ago

Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov

Abstract

Diffusion Policies have become widely used in Imitation Learning, offeringseveral appealing properties, such as generating multimodal and discontinuousbehavior. As models are becoming larger to capture more complex capabilities,their computational demands increase, as shown by recent scaling laws.Therefore, continuing with the current architectures will present acomputational roadblock. To address this gap, we propose Mixture-of-DenoisingExperts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses currentstate-of-the-art Transformer-based Diffusion Policies while enablingparameter-efficient scaling through sparse experts and noise-conditionedrouting, reducing both active parameters by 40% and inference costs by 90% viaexpert caching. Our architecture combines this efficient scaling withnoise-conditioned self-attention mechanism, enabling more effective denoisingacross different noise levels. MoDE achieves state-of-the-art performance on134 tasks in four established imitation learning benchmarks (CALVIN andLIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based andTransformer Diffusion Policies by an average of 57% across 4 benchmarks, whileusing 90% fewer FLOPs and fewer active parameters compared to default DiffusionTransformer architectures. Furthermore, we conduct comprehensive ablations onMoDE's components, providing insights for designing efficient and scalableTransformer architectures for Diffusion Policies. Code and demonstrations areavailable at https://mbreuss.github.io/MoDE_Diffusion_Policy/.