neural voice cloning with a few samples
play

Neural Voice Cloning with a Few Samples Sercan O. Arik, Jitong - PowerPoint PPT Presentation

Neural Voice Cloning with a Few Samples Sercan O. Arik, Jitong Chen, Kainan Peng* , Wei Ping, Yanqi Zhou Motivations Text-to-speech (TTS) models can be conditioned on text and speaker identity. Text: linguistic information, content of the


  1. Neural Voice Cloning with a Few Samples Sercan O. Arik, Jitong Chen, Kainan Peng* , Wei Ping, Yanqi Zhou

  2. Motivations • Text-to-speech (TTS) models can be conditioned on text and speaker identity. • Text: linguistic information, content of the generated speech. • Speaker identity: speaker information (accent, pitch, speech rate…).

  3. Motivations • Text-to-speech (TTS) models can be conditioned on text and speaker identity. • Text: linguistic information, content of the generated speech. • Speaker identity: speaker information (accent, pitch, speech rate…). • Limitations: • Can only generate speech for observed speakers during training. • Require lots of speech samples per speaker (e.g., Deep Voice 2).

  4. Voice Cloning • Voice cloning: synthesize the voices of new speakers from a few speech samples (few-shot generative model). • Applications: personalized speech interfaces, content creation, assistive technology…

  5. Voice Cloning • Voice cloning: synthesize the voices of new speakers from a few speech samples (few-shot generative model). • Applications: personalized speech interfaces, content creation, assistive technology… • Challenges: • Generalization: learn the voice of a new speaker. • Efficiency: extract the speaker characteristics from a few speech samples. • Computational cost: cloning with low latency and small footprint. • Two approaches: • Speaker adaptation. • Speaker encoding.

  6. Speaker Adaptation • Fine-tune a pre-trained multi-speaker model for a new speaker. • Training data: a few text and audio pairs.

  7. Speaker Adaptation • Fine-tune a pre-trained multi-speaker model for a new speaker. • Training data: a few text and audio pairs. • Two options for speaker adaptation: Fine-tune the whole model Fine-tune the speaker embedding only

  8. Speaker Adaptation Analysis Speaker Adaptation Approaches Embedding-only Whole-model Cloning time 8 h 5 min # of parameters 128 25 million per speaker

  9. Speaker Encoding • Directly predict a new speaker embedding for a multi-speaker model. • Train a speaker encoder with audio and speaker embedding pairs.

  10. Speaker Encoding • Directly predict a new speaker embedding for a multi-speaker model. • Train a speaker encoder with audio and speaker embedding pairs. • Cloning time: a few seconds, more favorable for low-resource deployment.

  11. Results • Vocoder: classical Griffin-Lim algorithm. • Demo website: ht http://au audiodemos.g .github.i .io Speaker Adaptation Speaker Approaches Encoding Embedding-only Whole-model Naturalness 2.67 3.16 2.99 (5-scale) Mean Opinion Score (MOS) Similarity 2.95 3.16 2.85 (4-scale)

  12. Voice Morphing via Embedding Manipulation • BritishMale + AveragedFemale - AveragedMale = BritishFemale • BritishMale + AveragedAmerican - AveragedBritish = AmericanMale

  13. Thank you! Welcome to our poster, and listen to samples! Today, Session B, #91

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend