BMVC'23 Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading
This is an official implementation, based on the 2023 paper titled "Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading." This is a speaker-adaptive lip-reading method.
.
├── config.py # Configuration file, likely for setting up parameters or environment variables
├── cvtransforms.py # File for computer vision transformations
├── dataloader # Folder for data loading utilities
│ ├── dataset_op.py # Dataset if ID information is needed
│ ├── dataset_pl.py # Dataset implement in pytorch lightning
├── img
│ ├── overview.jpg
│ └── overview.pdf
├── label_sorted.txt # Text file containing sorted labels of LRW
├── models # Folder containing different model architectures
│ ├── model_enhance.py # Model for feature enhancement
│ ├── model_ensemble.py # Model for ensemble model
│ ├── model_r2plus1d.py # Baseline model
│ ├── model_SD.py # Speaker verification module
├── README.md # Readme file for project documentation
├── requirements.txt # File listing all the project's dependencies
├── scripts # Scripts for data preparation
│ └── prepare_lrw.py # Script for preparing the LRW dataset
├── train_baseline.py # Training script for baseline model
├── train_enhance.py # Training script for feature enhancement model
├── train_ensemble.py # Training script for ensemble models
└── train_SD.py # Training script for speaker verification module
- Download LRW Dataset:
- Run
scripts/prepare_lrw.py
to generate training samples of LRW :
python scripts/prepare_lrw.py
The mouth videos, labels, and word boundary information will be saved in the .pkl
format. We pack image sequence as jpeg
format into our .pkl
files and decoding via PyTurboJPEG. Please remember to set the path
in config.py
to the location where you have downloaded and pre-processed the dataset.
LRW-ID is a re-partitioning of LRW data. You need to download the corresponding split from LRW-ID.
Please remember to set the split_path
in config.py
to the location where you have downloaded and pre-processed the dataset.
Set up environment
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu113
We propose a 3-step training strategy for our speaker adaptive model, focusing on separate tasks at each step. Please refer to the 'Training Details' subsection in the 'More Detailed Experiments' section of the supplementary materials.
Firstly, we train the left speaker verification module with $L^{ID}{triple}$ and the right lip reading modules with $L^{VSR}{CE}$ separately.
Then, we introduce the feature enhancement module together with the learned speaker verification module and the lip reading module to continue the training process.
Finally, we freeze the feature enhancement module and the speaker verification module to introduce the suppression module to continue training until convergence.
Please set up config.py, e.g.
path = '/data1/lrw_roi_80_116_175_211_npy_gray_pkl_jpeg/'
split_path = '/home/luosongtao/code/LRW_ID-main/Splits/'
random_seed = 251
batch_size = 130
gpus = 2
base_lr = 2e-4 * batch_size/32.0
num_workers = 8
max_epoch = 40
resume_path =None
reg = 0.5
precision = 16
verison=0
alpha=0.1
and train with command:
python train_SD.py
Please set up config.py, e.g.
path = '/data1/lrw_roi_80_116_175_211_npy_gray_pkl_jpeg/'
split_path = '/home/luosongtao/code/LRW_ID-main/Splits/'
random_seed = 251
batch_size = 130
gpus = 2
base_lr = 2e-4 * batch_size/32.0
num_workers = 8
max_epoch = 10
resume_path =None
reg = 0.5
precision = 16
verison=0
alpha=0.1
and train with command:
python train_baseline.py
Please set up config.py, e.g.
path = '/data1/lrw_roi_80_116_175_211_npy_gray_pkl_jpeg/'
split_path = '/home/luosongtao/code/LRW_ID-main/Splits/'
random_seed = 251
batch_size = 130
gpus = 2
base_lr = 2e-4 * batch_size/32.0
num_workers = 8
max_epoch = 10
resume_path =None
reg = 0.5
precision = 16
verison=0
alpha=0.1
and train with command:
python train_enhance.py --SD_model_path /home/luosongtao/code/LSHUC/SD_logs/crop_flip_cl_251/version_63/checkpoints/checkpoints-epoch=09-val_loss=0.01.ckpt --baseline_model_path /home/luosongtao/code/LSHUC/Baseline_logs/crop_flip_cl_251/version_1/checkpoints/checkpoints-epoch=37-val_loss=0.57-val_wer=0.13.ckpt
Please set up config.py, e.g.
path = '/data1/lrw_roi_80_116_175_211_npy_gray_pkl_jpeg/'
split_path = '/home/luosongtao/code/LRW_ID-main/Splits/'
random_seed = 251
batch_size = 130
gpus = 2
base_lr = 2e-4 * batch_size/32.0
num_workers = 8
max_epoch = 10
resume_path =None
reg = 0.5
precision = 16
verison=0
alpha=0.1
and train with command:
python train_ensemble.py --enhance_model_path /home/luosongtao/code/LSHUC/Enhance_logs/crop_flip_cl_251/version_1/checkpoints/checkpoints-epoch=09-val_loss=0.55-val_wer=0.12.ckpt
Method | $ L^{Enh}{{triple}} $ & $ L^{Sup}{{triple}} $ | $ L^{VSR}_{CE} $ | Acc(%) | |
---|---|---|---|---|
Baseline | - | - | ✓ | 87.25 |
Ours | ❌ | ❌ | ✓ | 87.73 |
Ours | ✓ | ❌ | ✓ | 87.74 |
Ours | ❌ | ✓ | ✓ | 87.75 |
Ours | ✓ | ✓ | ✓ | 87.91 |
@inproceedings{Luo_2023_BMVC,
author = {Songtao Luo and Shuang Yang and Shiguang Shan and Xilin Chen},
title = {Learning Separable Hidden Unit Contributions for Speaker-Adaptive Visual Speech Recognition},
booktitle = {34th British Machine Vision Conference 2023, {BMVC} 2023, Aberdeen, UK, November 20-24, 2023},
publisher = {BMVA},
year = {2023},
url = {https://bmvc2022.mpi-inf.mpg.de/BMVC2023/0146.pdf}
}