Skip to content

Latest commit

 

History

History
48 lines (29 loc) · 2.14 KB

README.md

File metadata and controls

48 lines (29 loc) · 2.14 KB

SpeechT5-Non-English-TTS

Fine-tune SpeechT5 for non-English text-to-speech tasks, implemented in PyTorch.

speecht5_framework

This repository contains code and resources for fine-tuning (or training) a SpeechT5 model on a non-English language for a text-to-speech task. The project leverages Huggingface's transformers library and speechbrain to load necessary models and tools. Other parts of the code, such as data preprocessing and train and evaluate functions, have been fully implemented using PyTorch. Therefore, feel free to make any changes you need to train your model efficiently.


Project Overview

The main objective of this project is to fine-tune the SpeechT5 model for text-to-speech on a non-English language. The steps include:

  1. Setting up the environment.
  2. Loading necessary tools (tokenizer and feature extractor) and models (SpeechT5 itself, a model to generate X-vector speaker embeddings, and the vocoder).
  3. Most importantly: Adding the unique characters of the language you want to fine-tune the model on to the tokenizer and modifying the input embedding matrix of the model accordingly.
  4. Loading and preprocessing your data.
  5. Training and evaluating the model.

Generated Samples

Here are some generated samples from the model that I trained on the Persian Common Voice dataset.

Sample 1

1.mp4

Sample 2

2.mp4

Sample 3

3.mp4

Sample 4

4.mp4

Sample 5

5.mp4

References

This code draws lessons from:
https://huggingface.co/learn/audio-course/en/chapter6/fine-tuning