DNN Based TTS Systems TTS Architecture: Traditional Pipeline - - PowerPoint PPT Presentation

dnn based tts systems tts architecture traditional
SMART_READER_LITE
LIVE PREVIEW

DNN Based TTS Systems TTS Architecture: Traditional Pipeline - - PowerPoint PPT Presentation

DNN Based TTS Systems TTS Architecture: Traditional Pipeline Typical statistical parametric TTS commonly contains different modules including: a text frontend extracting various linguistic features a duration model an acoustic


slide-1
SLIDE 1

DNN Based TTS Systems

slide-2
SLIDE 2

TTS Architecture: Traditional Pipeline

  • Typical statistical parametric TTS commonly contains different modules

including:

  • a text frontend extracting various linguistic features
  • a duration model
  • an acoustic feature prediction model
  • and a complex signal-processing-based vocoder
slide-3
SLIDE 3

End-to-End Approach

  • There are many advantages of an integrated end-to-end TTS system

that can be trained on<text, audio>pairs:

  • First, such a system alleviates the need for laborious feature engineering, which

may involve heuristics and brittle design choices.

  • Second, it more easily allows for rich conditioning on various attributes, such as

speaker or language, or high-level features like sentiment.

  • Similarly, adaptation to new data might also be easier.
  • Finally, a single model is likely to be more robust than a multi-stage model

where each component’s errors can compound.

slide-4
SLIDE 4

Challenges

  • TTS is a large-scale inverse problem: a highly compressed source (text)

is “decompressed” into audio. Since the same text can correspond to different pronunciations or speaking styles, this is a particularly difficult learning task for an end-to-end model: it must cope with large variations at the signal level for a given input.

  • Unlike end-to-end speech recognition or machine translation TTS
  • utputs are continuous, and output sequences are usually much longer

than those of the input. These attributes cause prediction errors to accumulate quickly.

slide-5
SLIDE 5

Tacotron

  • An end-to-end generative TTS model based on the sequence-to-

sequence (seq2seq) with attention paradigm.

  • Tacotron takes characters as input and outputs raw spectrogram.
  • It does not require phoneme-level alignment, so it can easily scale to

using large amounts of acoustic data with transcripts.

  • With a simple waveform synthesis technique, Tacotron produces a 3.82

mean opinion score (MOS) on an US English eval set, outperforming a production parametric system in terms of naturalness.

slide-6
SLIDE 6

Tacotron II

  • Entirely neural
  • Uses WaveNet as vocoder
  • Achieves a MOS of 4.53 comparable to a MOS of 4.58for professionally

recorded speech.

slide-7
SLIDE 7

Tacotron II Architecture

slide-8
SLIDE 8

Samples

Sentence Tacotron II Tacotron I Generative adversarial network or variational auto-encoder. He has read the whole thing. He reads books. Thisss isrealy awhsome. This is your personal assistant, Google Home. This is your personal assistant Google Home. The buses aren't the problem, they actually provide a solution. The buses aren't the PROBLEM, they actually provide a SOLUTION. The quick brown fox jumps over the lazy dog. Does the quick brown fox jump over the lazy dog?

slide-9
SLIDE 9

End-to-End Tacotron II Samples (Persian)

متسه ،شابداش دماح ،نم ،ملبس. اب قرش زا ،ناتسنمکرت و ناتسنمرا ،ناجيابرذآ يروهمج اب لامش زا ناريا زا نينچمه و تسا هياسمه قارع و هيکرت اب برغ زا و ناتسکاپ و ناتسناغفا جيلخ هب بونج زا و رزخ يايرد هب لامش يمسر ملبعا کي طقف ،رازس ندمآ ات و هدش لح تلبکشم يمامت ًارهاظ تسا هدنام يقاب .

slide-10
SLIDE 10

Vocoders

  • WaveNet
  • Parallel WaveNet
  • WaveGlow
  • MelGAN
  • WaveRNN
  • LPCNet
  • Etc.
slide-11
SLIDE 11

WaveNet

  • Fully convolutional autoregressive
  • Fast at training but slow at inference time
slide-12
SLIDE 12

WaveGlow

  • WaveGlow combines insights from Glow and WaveNet.
  • Produces audio samples at a rate of more than 500 kHz on an NVIDIA

V100 GPU.

slide-13
SLIDE 13

Tacotron 2 + WaveGlow Samples

slide-14
SLIDE 14

MelGAN

  • Non-autoregressive feed-forward convolutional architecture to

perform audio waveform generation in a GAN setup.

  • MelGAN is substantially faster than other mel-spectrogram inversion
  • alternatives. In particular, it is 10 times faster than the fastest available

model to date without considerable degradation in audio quality.

slide-15
SLIDE 15

MelGAN Generator

slide-16
SLIDE 16

MelGAN Descriminator

slide-17
SLIDE 17

Losses

slide-18
SLIDE 18
slide-19
SLIDE 19

Tacotron 2 + MelGAN Samples

slide-20
SLIDE 20

Questions?