Skip to content
/ synatt Public

Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNs

License

Notifications You must be signed in to change notification settings

awsaf49/synatt

Repository files navigation

Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNs

Paper: https://arxiv.org/abs/2309.08146

Abstract: With the huge technological advances introduced by deep learning in audio & speech processing, many novel synthetic speech techniques achieved incredible realistic results. As these methods generate realistic fake human voices, they can be used in malicious acts such as people imitation, fake news, spreading, spoofing, media manipulations, etc. Hence, the ability to detect synthetic or natural speech has become an urgent necessity. Moreover, being able to tell which algorithm has been used to generate a synthetic speech track can be of preeminent importance to track down the culprit. In this paper, a novel strategy is proposed to attribute a synthetic speech track to the generator that is used to synthesize it. The proposed detector transforms the audio into log-mel spectrogram, extracts features using CNN, and classifies it between five known and unknown algorithms, utilizing semi-supervision and ensemble to improve its robustness and generalizability significantly. The proposed detector is validated on two evaluation datasets consisting of a total of 18,000 weakly perturbed (Eval 1) & 10,000 strongly perturbed (Eval 2) synthetic speeches. The proposed method outperforms other top teams in accuracy by 12-13% on Eval 2 and 1-2% on Eval 1, in the IEEE SP Cup challenge at ICASSP 2022.


Result on IEEE Sp Cup 2023

Score of the top 3 teams on the leaderboard of IEEE Signal Processing CUP 2022.

Strongly Perturbed:

Method / Metric Acc Prc Rec F1
Std. Proc. 0.48 0.62 0.48 0.48
Team IITH 0.49 0.51 0.49 0.49
Synthesizer (Ours) 0.61 0.71 0.61 0.63

Weakly Perturbed:

Method / Metric Acc Prc Rec F1
Std. Proc. 0.97 0.97 0.96 0.97
Team IITH 0.96 0.96 0.95 0.96
Synthesizer (Ours) 0.98 0.99 0.97 0.98

How to Run?

Important Notes
  • All audio files must be in .wav format.
  • Sample Rate must be 16,000.
  • For training, batch_size is tuned for 8 x V100. If models is trained in other device, batch_size needs to be tuned accordingly using --batch argument.
  • learning_rate depends on batch_size hence if it batch_size is altered then learning_rate needs to be tuned accordingly.
  • Total epochs is determined using Cross-Validation for provided training data. If Training data is changed then Total epochs needs to be tuned using Cross-Validation, setting --all-data=0 in train.py.
  • While training, Internet Connection is required to download ImageNet weights for CNN Backbones.
  • To reproduce the result, it is recommended to run code in same Device Configuration.
  • For inference, batch_size is tuned for 8 x V100. For any other device, batch_size may need to be modified. To modify batch_size change following codes in predict.py,
# CONFIGURE BATCHSIZE
mx_dim = np.sqrt(np.prod(dim))
if mx_dim>=768 or any(i in model_name for i in ['convnext','ECA_NFNetL2']):
    CFG.batch_size = CFG.replicas * 16
elif mx_dim>=640  or any(i in model_name for i in ['EfficientNet','RegNet','ResNetRS50','ResNest50']):
    CFG.batch_size = CFG.replicas * 32
else:
    CFG.batch_size = CFG.replicas * 64
  • For any queries, please contact awsaf49@gmail.com.

Notebooks

To demonstrate Training and Inference 2 notebooks have been provided. It is recommended to go through them after README.md.

  • Inference: To directly generate prediction on eval data without any Training using provided checkpoints, refer to sp2022-infer-gpu notebook at notebooks/sp2022-infer-gpu.ipynb
  • Training: For training and then infering using newly trained weights refer to sp2022-train-gpu notebook at notebooks/sp2022-train-gpu.ipynb

0. Requirements

Hardware
  • GPU (model or N/A): 8x NVIDIA Tesla V100
  • Memory (GB): 8 x 32GB
  • OS: Amazon Linux
  • CUDA Version : 11.0
  • Driver Version : 450.119.04
  • CPU RAM : 128 GiB
  • DISK : 2 TB

Library

Install necessary dependencies using following command,

!pip install -r requirements.txt

1. Data Preparation

  • Step 1: Competition data needs to be in the ./data/ folder. It is mandatory to have the data in exact same format like it was provided. SP Cup dataset can be accessed from kaggle using below link,

  • Step 2: External datasets need to be downloaded from following links and need to be in the ./data/ folder,

    1. LJSpeech: link (~2GB)
    2. VCTK: link (~3GB)
    3. LibriSpeech: link (~15GB)
    4. Synthetic: link (~5GB)

Note: All the datasets were pre-processed to have exact same sample_rate = 16k and file_format = .wav.

Data Path Format

Datasets are expected to have following format. To use custom directory, PATHS.json needs to modified

Path Structure
├── data
│   ├── sp22-synthetic-dataset
│   ├── librispeech-small-dataset
│   ├── ljspeech-sr16k-dataset
│   ├── vctk-sr16k-dataset
│   ├── spcup_2022_training_part1
│   │   └── spcup_2022_training_part1
│   ├── spcup_2022_unseen
│   │   └── spcup_2022_unseen
│   ├── spcup_2022_eval_part1
│   │   └── spcup_2022_eval_part1
│   ├── spcup_2022_eval_part2
│   │   └── spcup_2022_eval_part2

2. Supervisied Training

Competition & external data and their associated labels will be used for Supervised Training. All external data is considered as Unknown Algorithm.

Part-1

For Training models for eval_part1 data run following commands,

Code
!python3 train.py\
--cfg ./configs/sp22-part1.yaml\
--output-dir=output/supervised/part1\
--model=EfficientNetB0\
--batch=64\
--epochs=11
!python3 train.py\
--cfg ./configs/sp22-part1.yaml\
--output-dir=output/supervised/part1\
--model=ResNet50D\
--batch=64\
--epochs=9
!python3 train.py\
--cfg ./configs/sp22-part1.yaml\
--output-dir=output/supervised/part1\
--model=ResNetRS50\
--batch=32\
--epochs=13
!python3 train.py\
--cfg ./configs/sp22-part1.yaml\
--output-dir=output/supervised/part1\
--model=ResNest50\
--batch=32\
--epochs=21
!python3 train.py\
--cfg ./configs/sp22-part1.yaml\
--output-dir=output/supervised/part1\
--model=RegNetZD8\
--batch=64\
--epochs=8
!python3 train.py\
--cfg ./configs/sp22-part1.yaml\
--output-dir=output/supervised/part1\
--model=EfficientNetV2S\
--pretrain=imagenet21k\
--batch=32\
--epochs=25

Part-2

For Training models for eval_part2 data run following commands,

Code
!python3 train.py\
--cfg ./configs/sp22-part2.yaml\
--output-dir=output/supervised/part2\
--model=ECA_NFNetL2\
--batch=16\
--epochs=12
!python3 train.py\
--cfg ./configs/sp22-part2.yaml\
--output-dir=output/supervised/part2\
--model=convnext_base_in22k\
--batch=32\
--epochs=14
!python3 train.py\
--cfg ./configs/sp22-part2.yaml\
--output-dir=output/supervised/part2\
--model=ResNetRS152\
--batch=32\
--epochs=11
!python3 train.py\
--cfg ./configs/sp22-part2.yaml\
--output-dir=output/supervised/part2\
--model=convnext_large_in22k\
--batch=16\
--epochs=15
!python3 train.py\
--cfg ./configs/sp22-part2.yaml\
--output-dir=output/supervised/part2\
--model=RegNetZD8\
--batch=32\
--epochs=5
!python3 train.py\
--cfg ./configs/sp22-part2.yaml\
--output-dir=output/supervised/part2\
--model=EfficientNetB0\
--batch=64\
--epochs=13
!python3 train.py\
--cfg ./configs/sp22-part2.yaml\
--output-dir=output/supervised/part2\
--model=EfficientNetV2M\
--pretrain=imagenet21k\
--batch=32\
--epochs=15

3. Generage Semi-Supervised Labels

Run following command to generate Semi-Supervisied lables for both eval_part1& & eval_part2 data. Semi-Supervised labels will be saved at output/supervised/pseudo/pred.csv.

Code
!python generate_pseudo.py\
--part1-model-dir=output/supervised/part1\
--part1-infer-path=data/spcup_2022_eval_part1/spcup_2022_eval_part1\
--part2-model-dir=output/supervised/part2\
--part2-infer-path=data/spcup_2022_eval_part1/spcup_2022_eval_part2\
--output=output/supervised/pseudo/pred.csv

4. Semi-Supervised Training

In this stage Competition & External data will be used along with eval_part1 & eval_part2 data. For eval_data their semi-supervised labels will be used which were generated in previous stage.

Part-1

For Training models for eval_part1 data run following commands,

Code
!python3 train.py\
--cfg ./configs/sp22-part1.yaml\
--output-dir=output/semi-supervised/part1\
--model=EfficientNetB0\
--batch=64\
--epochs=15\
--pseudo 1\
--pseudo_csv=output/supervised/pseudo/pred.csv
!python3 train.py\
--cfg ./configs/sp22-part1.yaml\
--output-dir=output/semi-supervised/part1\
--model=ResNet50D\
--batch=64\
--epochs=18\
--pseudo 1\
--pseudo_csv=output/supervised/pseudo/pred.csv
!python3 train.py\
--cfg ./configs/sp22-part1.yaml\
--output-dir=output/semi-supervised/part1\
--model=ResNetRS50\
--batch=32\
--epochs=17\
--pseudo 1\
--pseudo_csv=output/supervised/pseudo/pred.csv
!python3 train.py\
--cfg ./configs/sp22-part1.yaml\
--output-dir=output/semi-supervised/part1\
--model=ResNest50\
--batch=32\
--epochs=16\
--pseudo 1\
--pseudo_csv=output/supervised/pseudo/pred.csv
!python3 train.py\
--cfg ./configs/sp22-part1.yaml\
--output-dir=output/semi-supervised/part1\
--model=RegNetZD8\
--batch=64\
--epochs=12\
--pseudo 1\
--pseudo_csv=output/supervised/pseudo/pred.csv
!python3 train.py\
--cfg ./configs/sp22-part1.yaml\
--output-dir=output/semi-supervised/part1\
--model=EfficientNetV2S\
--pretrain=imagenet21k\
--batch=64\
--epochs=7\
--pseudo 1\
--pseudo_csv=output/supervised/pseudo/pred.csv

Part-2

For Training models for eval_part2 data run following commands,

Code
!python3 train.py\
--cfg ./configs/sp22-part2.yaml\
--output-dir=output/semi-supervised/part2\
--model=ECA_NFNetL2\
--batch=16\
--epochs=11\
--pseudo 1\
--pseudo_csv=output/supervised/pseudo/pred.csv
!python3 train.py\
--cfg ./configs/sp22-part2.yaml\
--output-dir=output/semi-supervised/part2\
--model=convnext_base_in22k\
--batch=16\
--epochs=6\
--pseudo 1\
--pseudo_csv=output/supervised/pseudo/pred.csv
!python3 train.py\
--cfg ./configs/sp22-part2.yaml\
--output-dir=output/semi-supervised/part2\
--model=ResNetRS152\
--batch=32\
--epochs=16\
--pseudo 1\
--pseudo_csv=output/supervised/pseudo/pred.csv
!python3 train.py\
--cfg ./configs/sp22-part2.yaml\
--output-dir=output/semi-supervised/part2\
--model=convnext_large_in22k\
--batch=16\
--epochs=10\
--pseudo 1\
--pseudo_csv=output/supervised/pseudo/pred.csv
!python3 train.py\
--cfg ./configs/sp22-part2.yaml\
--output-dir=output/semi-supervised/part2\
--model=RegNetZD8\
--batch=32\
--epochs=8\
--pseudo 1\
--pseudo_csv=output/supervised/pseudo/pred.csv
!python3 train.py\
--cfg ./configs/sp22-part2.yaml\
--output-dir=output/semi-supervised/part2\
--model=EfficientNetB0\
--batch=32\
--epochs=10\
--pseudo 1\
--pseudo_csv=output/supervised/pseudo/pred.csv
!python3 train.py\
--cfg ./configs/sp22-part2.yaml\
--output-dir=output/semi-supervised/part2\
--model=EfficientNetV2M\
--pretrain=imagenet21k\
--batch=32\
--epochs=25\
--pseudo 1\
--pseudo_csv=output/supervised/pseudo/pred.csv

5. Prediction with Trained Models

For predicting on eval_data using newly trained models use following codes,

Part-1

To generate prediction for eval_part1 data using newly-trained checkpoints run following commands,

Code
!python predict.py\
--cfg ./configs/sp22-part1.yaml\
--model-dir=output/semi-supervised/part1\
--infer-path=data/spcup_2022_eval_part1/spcup_2022_eval_part1\
--output=output/result/pred_part1.csv

Part-2

To generate prediction for eval_part2 data using newly-trained checkpoints run following commands,

Code
!python predict.py\
--cfg ./configs/sp22-part2.yaml\
--model-dir=output/semi-supervised/part2\
--infer-path=data/spcup_2022_eval_part2/spcup_2022_eval_part2\
--output=output/result/pred_part2.csv

6. Prediction with Pre-Trained Checkpoints

To generage prediction on eval_data directly using pre-trained chekpointss, first download the checkpoints using following links,

  • part-1: link (~1GB)
  • part-2: link (~6GB)

Extract the .zip files and keep the part1 files on ./checkpoints/part1 folder and part2 files on ./checkpoints/part2 folder. So, final file structure will look like this,

Part-1 Structure (6 Models)
./checkpoints/part1
├── EfficientNetB0-128x384
│   └── ckpt
│       └── model.h5
├── EfficientNetV2S-128x384
│   └── ckpt
│       └── model.h5
├── RegNetZD8-128x384
│   └── ckpt
│       └── model.h5
├── ResNest50-128x384
│   └── ckpt
│       └── model.h5
├── ResNet50D-128x384
│   └── ckpt
│       └── model.h5
└── ResNetRS50-128x384
    └── ckpt
        └── model.h5
Part-2 Structure (7 Models))
./checkpoints/part2
├── ECA_NFNetL2-256x512
│   └── ckpt
│       └── model.h5
├── EfficientNetB0-256x512
│   └── ckpt
│       └── model.h5
├── EfficientNetV2M-256x512
│   └── ckpt
│       └── model.h5
├── RegNetZD8-256x512
│   └── ckpt
│       └── model.h5
├── ResNetRS152-256x512
│   └── ckpt
│       └── model.h5
├── convnext_base_in22k-256x512
│   └── ckpt
│       └── model.h5
└── convnext_large_in22k-256x512
    └── ckpt
        └── model.h5

Then use following commands to generate predictions for eval_data,

Part-1

To generate prediction for eval_part1 data using provided checkpoints run following commands,

Code
!python predict.py\
--cfg ./configs/sp22-part1.yaml\
--model-dir=checkpoints/part1\
--infer-path=data/spcup_2022_eval_part1/spcup_2022_eval_part1\
--output=output/result/pred_part1.csv

Part-2

To generate prediction for eval_part2 data using provided checkpoints run following commands,

Code
!python predict.py\
--cfg ./configs/sp22-part2.yaml\
--model-dir=checkpoints/part2\
--infer-path=data/spcup_2022_eval_part2/spcup_2022_eval_part2\
--output=output/result/pred_part2.csv

Citation

@article{rahman2023syn,
  title={Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNs},
  author={Rahman, Md Awsafur and Paul, Bishmoy and Sarker, Najibul Haque and Hakim, Zaber Ibn Abdul and Fattah, Shaikh Anowarul and Saquib, Mohammad},
  journal={arXiv preprint arXiv:2309.08146},
  year={2023}
}

Acknowledgement

The authors thank the IEEE Signal Processing Society, ISPL at Politecnico di Milano (Italy), and MISL at Drexel University (USA) for hosting the IEEE SP Cup at ICASSP 2022, which inspired this work.

About

Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published