MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

The official PyTorch implementation of the paper "MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model".

Please visit our webpage for more details.

News

📢 17/Jun/24 - First release - pretrained models, train and test code.

Todo list

Custom Speech Tutorial
Train autoencoder for FGD

1. Setup environment

conda create -n mmofusion python=3.7
conda activate mmofusion

pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
pip install -r requirements.txt

2. Data preparation

Download the BEAT datasets, choose the English data v0.2.1.

We preprocess the data based on the DiffuseStyleGesture, thanks for their great work!

Download the audio prepocess model WavLM-Large and text prepocess model crawl-300d-2M.

cd ./process/

python process_BEAT_bvh.py /your/BEAT/path/ /path/to/BEAT/processed/ None None "v0" "step1" "cuda:0"

python process_BEAT_bvh.py /your/BEAT/path/ /path/to/BEAT/processed/ "/your/weights/WavLM-Large.pt" "/your/weights/crawl-300d-2M.vec" "v0" "step3" "cuda:0"

The processed data will be saved in /path/to/BEAT/processed/, before converting it into H5file, you can split the data into train/val/test as our setting by the script data_split_30.ipynb. After that, you will get the H5file BEAT_v0_train.h5 by running:

python process_BEAT_bvh.py /your/BEAT/path/ /path/to/BEAT/processed/ None None "v0" "step4" "cuda:0"

and get the mean, std in ./process/ by running:

python calculate_gesture_statistics.py --dataset BEAT --version "v0"

3. Test

Download our pretrained models including motion generation with upper body and whole body.

You can also find the pretrained autoencoder model last_600000.bin, which we trained it on 30 speakers data.

Edit the model_path and e_path to load the pretrained models for test, and tst_path to load the processed test data.

for upper body
python sample_linear.py --config=./configs/mmofusion.yml --gpu 0

for whole body
python sample_linear.py --config=./configs/mmofusion_whole.yml --gpu 0

You can also modify the weight guidance_param since we use the classifier-free guidance during training.

4. Train

Edit the h5file in the config to load the H5file BEAT_v0_train.h5.

for upper body
python train.py --config=./configs/mmofusion.yml --gpu 0

for whole body 
...

Citation

If you find this repo useful for your research, please consider citing our paper:

@misc{wang2024mmofusion,
      title={MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model}, 
      author={Sen Wang and Jiangning Zhang and Weijian Cao and Xiaobin Hu and Moran Li and Xiaozhong Ji and Xin Tan and Mengtian Li and Zhifeng Xie and Chengjie Wang and Lizhuang Ma},
      year={2024},
      eprint={2403.02905},
      archivePrefix={arXiv},
      primaryClass={id='cs.MM' full_name='Multimedia' is_active=True alt_name=None in_archive='cs' is_general=False description='Roughly includes material in ACM Subject Class H.5.1.'}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
diffusion		diffusion
model		model
mydiffusion		mydiffusion
process		process
train		train
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mmofusion.png		mmofusion.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

News

Todo list

1. Setup environment

2. Data preparation

3. Test

4. Train

Citation

About

Releases

Packages

Languages

License

wangsen99/MMoFusion

Folders and files

Latest commit

History

Repository files navigation

MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

News

Todo list

1. Setup environment

2. Data preparation

3. Test

4. Train

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages