Neural Voice Cloning with a Few Samples Sercan O. Arik, Jitong - - PowerPoint PPT Presentation

neural voice cloning with a few samples
SMART_READER_LITE
LIVE PREVIEW

Neural Voice Cloning with a Few Samples Sercan O. Arik, Jitong - - PowerPoint PPT Presentation

Neural Voice Cloning with a Few Samples Sercan O. Arik, Jitong Chen, Kainan Peng* , Wei Ping, Yanqi Zhou Motivations Text-to-speech (TTS) models can be conditioned on text and speaker identity. Text: linguistic information, content of the


slide-1
SLIDE 1

Neural Voice Cloning with a Few Samples

Sercan O. Arik, Jitong Chen, Kainan Peng*, Wei Ping, Yanqi Zhou

slide-2
SLIDE 2

Motivations

  • Text-to-speech (TTS) models can be conditioned on text and speaker identity.
  • Text: linguistic information, content of the generated speech.
  • Speaker identity: speaker information (accent, pitch, speech rate…).
slide-3
SLIDE 3

Motivations

  • Text-to-speech (TTS) models can be conditioned on text and speaker identity.
  • Text: linguistic information, content of the generated speech.
  • Speaker identity: speaker information (accent, pitch, speech rate…).
  • Limitations:
  • Can only generate speech for observed speakers during training.
  • Require lots of speech samples per speaker (e.g., Deep Voice 2).
slide-4
SLIDE 4

Voice Cloning

  • Voice cloning: synthesize the voices of new speakers from a few speech

samples (few-shot generative model).

  • Applications: personalized speech interfaces, content creation, assistive

technology…

slide-5
SLIDE 5

Voice Cloning

  • Voice cloning: synthesize the voices of new speakers from a few speech

samples (few-shot generative model).

  • Applications: personalized speech interfaces, content creation, assistive

technology…

  • Challenges:
  • Generalization: learn the voice of a new speaker.
  • Efficiency: extract the speaker characteristics from a few speech samples.
  • Computational cost: cloning with low latency and small footprint.
  • Two approaches:
  • Speaker adaptation.
  • Speaker encoding.
slide-6
SLIDE 6

Speaker Adaptation

  • Fine-tune a pre-trained multi-speaker model for a new speaker.
  • Training data: a few text and audio pairs.
slide-7
SLIDE 7
  • Two options for speaker adaptation:

Fine-tune the whole model Fine-tune the speaker embedding only

Speaker Adaptation

  • Fine-tune a pre-trained multi-speaker model for a new speaker.
  • Training data: a few text and audio pairs.
slide-8
SLIDE 8

Speaker Adaptation Analysis

Approaches Speaker Adaptation Embedding-only Whole-model Cloning time 8 h 5 min # of parameters per speaker 128 25 million

slide-9
SLIDE 9

Speaker Encoding

  • Directly predict a new speaker embedding for a multi-speaker model.
  • Train a speaker encoder with audio and speaker embedding pairs.
slide-10
SLIDE 10

Speaker Encoding

  • Directly predict a new speaker embedding for a multi-speaker model.
  • Train a speaker encoder with audio and speaker embedding pairs.
  • Cloning time: a few seconds, more favorable for low-resource deployment.
slide-11
SLIDE 11

Results

  • Vocoder: classical Griffin-Lim algorithm.
  • Demo website: ht

http://au audiodemos.g .github.i .io

Approaches Speaker Adaptation Speaker Encoding Embedding-only Whole-model Mean Opinion Score (MOS) Naturalness (5-scale) 2.67 3.16 2.99 Similarity (4-scale) 2.95 3.16 2.85

slide-12
SLIDE 12

Voice Morphing via Embedding Manipulation

  • BritishMale + AveragedFemale - AveragedMale = BritishFemale
  • BritishMale + AveragedAmerican - AveragedBritish = AmericanMale
slide-13
SLIDE 13

Thank you!

Welcome to our poster, and listen to samples! Today, Session B, #91