Free-T2M: Frequency Enhanced Text-to-Motion Diffusion Model With Consistency Loss

Rapid progress in text-to-motion generation has been largely driven bydiffusion models. However, existing methods focus solely on temporal modeling,thereby overlooking frequency-domain analysis. We identify two key phases inmotion denoising: the semantic planning stage and the fine-grainedimproving stage. To address these phases effectively, we proposeFrequency enhanced text-to-motion diffusion model(Free-T2M), incorporating stage-specific consistency losses that enhancethe robustness of static features and improve fine-grained accuracy. Extensiveexperiments demonstrate the effectiveness of our method. Specifically, onStableMoFusion, our method reduces the FID from 0.189 to 0.051,establishing a new SOTA performance within the diffusion architecture. Thesefindings highlight the importance of incorporating frequency-domain insightsinto text-to-motion generation for more precise and robust results.