DNN Based TTS Systems TTS Architecture: Traditional Pipeline - - PowerPoint PPT Presentation

▶

Oct 22, 2022 119 likes •341 views

DNN Based TTS Systems TTS Architecture: Traditional Pipeline Typical statistical parametric TTS commonly contains different modules including: a text frontend extracting various linguistic features a duration model an acoustic

SLIDE 1

DNN Based TTS Systems

SLIDE 2

TTS Architecture: Traditional Pipeline

Typical statistical parametric TTS commonly contains different modules

including:

a text frontend extracting various linguistic features
a duration model
an acoustic feature prediction model
and a complex signal-processing-based vocoder

SLIDE 3

End-to-End Approach

There are many advantages of an integrated end-to-end TTS system

that can be trained on<text, audio>pairs:

First, such a system alleviates the need for laborious feature engineering, which

may involve heuristics and brittle design choices.

Second, it more easily allows for rich conditioning on various attributes, such as

speaker or language, or high-level features like sentiment.

Similarly, adaptation to new data might also be easier.
Finally, a single model is likely to be more robust than a multi-stage model

where each component’s errors can compound.

SLIDE 4

Challenges

TTS is a large-scale inverse problem: a highly compressed source (text)

is “decompressed” into audio. Since the same text can correspond to different pronunciations or speaking styles, this is a particularly difficult learning task for an end-to-end model: it must cope with large variations at the signal level for a given input.

Unlike end-to-end speech recognition or machine translation TTS
utputs are continuous, and output sequences are usually much longer

than those of the input. These attributes cause prediction errors to accumulate quickly.

SLIDE 5

Tacotron

An end-to-end generative TTS model based on the sequence-to-

sequence (seq2seq) with attention paradigm.

Tacotron takes characters as input and outputs raw spectrogram.
It does not require phoneme-level alignment, so it can easily scale to

using large amounts of acoustic data with transcripts.

With a simple waveform synthesis technique, Tacotron produces a 3.82

mean opinion score (MOS) on an US English eval set, outperforming a production parametric system in terms of naturalness.

SLIDE 6

Tacotron II

Entirely neural
Uses WaveNet as vocoder
Achieves a MOS of 4.53 comparable to a MOS of 4.58for professionally

recorded speech.

SLIDE 7

Tacotron II Architecture

SLIDE 8

Samples

Sentence Tacotron II Tacotron I Generative adversarial network or variational auto-encoder. He has read the whole thing. He reads books. Thisss isrealy awhsome. This is your personal assistant, Google Home. This is your personal assistant Google Home. The buses aren't the problem, they actually provide a solution. The buses aren't the PROBLEM, they actually provide a SOLUTION. The quick brown fox jumps over the lazy dog. Does the quick brown fox jump over the lazy dog?

SLIDE 9

End-to-End Tacotron II Samples (Persian)

متسه ،شابداش دماح ،نم ،ملبس. اب قرش زا ،ناتسنمکرت و ناتسنمرا ،ناجيابرذآ يروهمج اب لامش زا ناريا زا نينچمه و تسا هياسمه قارع و هيکرت اب برغ زا و ناتسکاپ و ناتسناغفا جيلخ هب بونج زا و رزخ يايرد هب لامش يمسر ملبعا کي طقف ،رازس ندمآ ات و هدش لح تلبکشم يمامت ًارهاظ تسا هدنام يقاب .

SLIDE 10

Vocoders

WaveNet
Parallel WaveNet
WaveGlow
MelGAN
WaveRNN
LPCNet
Etc.

SLIDE 11

WaveNet

Fully convolutional autoregressive
Fast at training but slow at inference time

SLIDE 12

WaveGlow

WaveGlow combines insights from Glow and WaveNet.
Produces audio samples at a rate of more than 500 kHz on an NVIDIA

V100 GPU.

SLIDE 13

Tacotron 2 + WaveGlow Samples

SLIDE 14

MelGAN

Non-autoregressive feed-forward convolutional architecture to

perform audio waveform generation in a GAN setup.

MelGAN is substantially faster than other mel-spectrogram inversion
alternatives. In particular, it is 10 times faster than the fastest available

model to date without considerable degradation in audio quality.

SLIDE 15

MelGAN Generator

SLIDE 16

MelGAN Descriminator

SLIDE 17

Losses

SLIDE 18

SLIDE 19

Tacotron 2 + MelGAN Samples

SLIDE 20