You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Do you have any ideas or code changes to guide the generated speech style? For example, having the appropriate emotion if we are reading a news story about a tragedy.
I understand there are a few Tacotron projects that achieve this, but their methods often lead to degraded voice quality (in my opinion).
One crazy idea that is easy to try, but probably won't work, is to train on a new dataset and embed the emotion into the generated sequence encoding.
The text was updated successfully, but these errors were encountered:
We tried something similar to have more control over the synthesised speech in one of our succeeding works which is submitted already.
But the key idea we used, is that you can use the additional information by having an additional encoder process into some controllable space. Then the concatenation of the output of the additional encoder with the output of the text encoder will form the input sequence of states to the neural HMM. In order to provide stronger conditioning we also concatenated the additional information into the output net of the system. It worked and gave us control over the feature space.
Hope this helps! Feel free to ping in case you have further questions :D
Also, PS: In the upcoming days, we are releasing an upgraded model to neural HMM that performs better than the other baseline systems including Neural-HMM, Tacotron 2 with Postnet and Glow TTS in terms of clarity and naturalness.
We have released OverFlow. In OverFlow not only do we get better naturalness and more accurate pronunciations but we also show speaker adaptation in a low resource setting by simply fine-tuning the model with a way smaller dataset.
Hopefully, it will be useful for your use case as well.
This is really impressive work!
Do you have any ideas or code changes to guide the generated speech style? For example, having the appropriate emotion if we are reading a news story about a tragedy.
I understand there are a few Tacotron projects that achieve this, but their methods often lead to degraded voice quality (in my opinion).
One crazy idea that is easy to try, but probably won't work, is to train on a new dataset and embed the emotion into the generated sequence encoding.
The text was updated successfully, but these errors were encountered: