[Mentioned in Session] Interesting Voice Cloning and Speech Synthesis Papers

Will · 17 March 2021 21:31

I mentioned some interesting voice cloning/synthesis research papers/demos during the session.

I struggled to find the paper initially, so here are some scribblings from me meandering through notes and searches to get to the paper I mentioned where UK accents appear to be close to each other (since generated voice based on a Scottish accent produced some English accented results).

This one has some similarities, but is definitely not the one I’m thinking of:

Real Time Voice Cloning

Looking through some of the papers, there’s a chance I’ve conflated the “distance” between generated voices with another paper, but hopefully I can find the one I’m thinking of as it was quite impressive & contained multiple accents.

This is a rundown of some papers form 2019: https://heartbeat.fritz.ai/a-2019-guide-to-speech-synthesis-with-deep-learning-630afcafb9dd
Speech Synthesis | Papers With Code
DeepVoice (Baidu) version comparisons: https://multi-speaker-clarinet-demo.github.io/
Resemblyzer: GitHub - resemble-ai/Resemblyzer: A python package to analyze and compare voices with deep learning
Realtime Voice Cloning: GitHub - CorentinJ/Real-Time-Voice-Cloning: Clone a voice in 5 seconds to generate arbitrary speech in real-time - shows clustering of embeddings - Real-Time Voice Cloning Toolbox - YouTube
Style Transfer: GitHub - dipjyoti92/TTS-Style-Transfer: Official PyTorch implementation of TTS Style Transfer

Found it!
It’s from the Tacotron team at Google:
https://google.github.io/tacotron/publications/speaker_adaptation/

Notice how the generated speech based on the embedding of the speaker named “VCTK p260” (original voice has Scottish accent) has both english and Scottish accents in the generated voices, but no US accents.
I’m now not sure whether there was a plot or analysis from the authors, or whether I just reasoned that the “voice style/identity” embedding must have shorter distances between Scottish>English vs. Scottish>American to get that result.
That team has some interesting stuff: Audio samples from "Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning"