MotionMix

Abstract

Controllable generation of 3D human motions becomes an important topic as the world embraces digital transformation. Existing works, though making promising progress with the advent of diffusion models, heavily rely on meticulously captured and annotated (e.g., text) high-quality motion corpus, a resource-intensive endeavor in the real world. This motivates our proposed MotionMix, a simple yet effective weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences. Specifically, we separate the denoising objectives into two stages: obtaining conditional rough motion approximations in the initial \(T-T^*\) steps by learning on annotated noisy motions, followed by the unconditional refinement of these preliminary motions during the last \(T^*\) steps using unannotated motions. Notably, though learning on two sources of imperfect data simultaneously, our model does not compromise motion generation quality compared to fully supervised approaches that access gold data. Extensive experiments on several benchmarks demonstrate that our MotionMix, as a versatile framework, consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks.

MotionMix pioneers a new training paradigm for conditional human motion generation by training with both noisy annotated and clean unannotated data. By formulating as a weakly-supervised setting, it can achieve comparable or even outperforms its fully supervised variants, showcasing versatility in text-to-motion, action-to-motion, music-to-dance tasks. MotionMix's adaptability spans various benchmarks and diffusion architectures, resilient to noise and validated through ablation studies, offering a potent solution to data scarcity.

Approximated Noisy Motion Compared with Real Noisy Data

Text-to-Motion Results

We present visualizations of both the intermediate and final outputs of our two-stage sampling process, along with the approximated noisy and clean versions of the corresponding ground-truth motion. The videos clearly demonstrate that both the intermediate and final outputs closely match their respective ground-truth versions.

Music-to-Dance Results

We present the outputs of EDGE (MotionMix) (trained with imperfect data sources) compared its baseline backbone EDGE (trained with gold data).

The experiment with real data (AIST++ and AMASS) improves the performance of MotionMix by generating less skating dance.

Please unmute for the music

Action-to-Motion Results

We present the outputs of MDM (MotionMix) (trained with imperfect data sources) compared its baseline backbone MDM (trained with gold data).

BibTeX


        @misc{hoang2024motionmix,
          title={MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation}, 
          author={Nhat M. Hoang and Kehong Gong and Chuan Guo and Michael Bi Mi},
          year={2024},
          eprint={2401.11115},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
        }

MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation

Abstract

Approximated Noisy Motion Compared with Real Noisy Data

Text-to-Motion Results

the toon is walking across the plane at a diagonal pattern, reaching the end of the plane & turning around.

the person is practicing balancing on one leg.

someone is lifting weights above their head

a person walks forward and then up stairs

the person is standing still doing body stretches.

a person who has his hands by his sides slowly creeps forward while looking about.

a man mimics a throwing motion with his left hand.

a person walks backwards then crawls forward.

person appears to be holding some thing with both hands and then throws it forward with their right hand.

a person is standing and looks like he is lifting a dumbell above his head both his arms arm moving up and down.

the person climbs up something for a few steps.

Music-to-Dance Results

Action-to-Motion Results

Poster

BibTeX