The official PyTorch implementation of the paper "MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model".
Please visit our webpage for more details.
📢 17/Jun/24 - First release - pretrained models, train and test code.
- Custom Speech Tutorial
- Train autoencoder for FGD
conda create -n mmofusion python=3.7
conda activate mmofusion
pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
pip install -r requirements.txt
Download the BEAT datasets, choose the English data v0.2.1.
We preprocess the data based on the DiffuseStyleGesture, thanks for their great work!
Download the audio prepocess model WavLM-Large and text prepocess model crawl-300d-2M.
cd ./process/
python process_BEAT_bvh.py /your/BEAT/path/ /path/to/BEAT/processed/ None None "v0" "step1" "cuda:0"
python process_BEAT_bvh.py /your/BEAT/path/ /path/to/BEAT/processed/ "/your/weights/WavLM-Large.pt" "/your/weights/crawl-300d-2M.vec" "v0" "step3" "cuda:0"
The processed data will be saved in /path/to/BEAT/processed/
, before converting it into H5file, you can split the data into train/val/test as our setting by the script data_split_30.ipynb
. After that, you will get the H5file BEAT_v0_train.h5
by running:
python process_BEAT_bvh.py /your/BEAT/path/ /path/to/BEAT/processed/ None None "v0" "step4" "cuda:0"
and get the mean, std in ./process/
by running:
python calculate_gesture_statistics.py --dataset BEAT --version "v0"
Download our pretrained models including motion generation with upper body and whole body.
You can also find the pretrained autoencoder model last_600000.bin
, which we trained it on 30 speakers data.
Edit the model_path
and e_path
to load the pretrained models for test, and tst_path
to load the processed test data.
for upper body
python sample_linear.py --config=./configs/mmofusion.yml --gpu 0
for whole body
python sample_linear.py --config=./configs/mmofusion_whole.yml --gpu 0
You can also modify the weight guidance_param
since we use the classifier-free guidance during training.
Edit the h5file
in the config to load the H5file BEAT_v0_train.h5
.
for upper body
python train.py --config=./configs/mmofusion.yml --gpu 0
for whole body
...
If you find this repo useful for your research, please consider citing our paper:
@misc{wang2024mmofusion,
title={MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model},
author={Sen Wang and Jiangning Zhang and Weijian Cao and Xiaobin Hu and Moran Li and Xiaozhong Ji and Xin Tan and Mengtian Li and Zhifeng Xie and Chengjie Wang and Lizhuang Ma},
year={2024},
eprint={2403.02905},
archivePrefix={arXiv},
primaryClass={id='cs.MM' full_name='Multimedia' is_active=True alt_name=None in_archive='cs' is_general=False description='Roughly includes material in ACM Subject Class H.5.1.'}
}