DNN Based TTS Systems TTS Architecture: Traditional Pipeline - - PowerPoint PPT Presentation
DNN Based TTS Systems TTS Architecture: Traditional Pipeline - - PowerPoint PPT Presentation
DNN Based TTS Systems TTS Architecture: Traditional Pipeline Typical statistical parametric TTS commonly contains different modules including: a text frontend extracting various linguistic features a duration model an acoustic
TTS Architecture: Traditional Pipeline
- Typical statistical parametric TTS commonly contains different modules
including:
- a text frontend extracting various linguistic features
- a duration model
- an acoustic feature prediction model
- and a complex signal-processing-based vocoder
End-to-End Approach
- There are many advantages of an integrated end-to-end TTS system
that can be trained on<text, audio>pairs:
- First, such a system alleviates the need for laborious feature engineering, which
may involve heuristics and brittle design choices.
- Second, it more easily allows for rich conditioning on various attributes, such as
speaker or language, or high-level features like sentiment.
- Similarly, adaptation to new data might also be easier.
- Finally, a single model is likely to be more robust than a multi-stage model
where each component’s errors can compound.
Challenges
- TTS is a large-scale inverse problem: a highly compressed source (text)
is “decompressed” into audio. Since the same text can correspond to different pronunciations or speaking styles, this is a particularly difficult learning task for an end-to-end model: it must cope with large variations at the signal level for a given input.
- Unlike end-to-end speech recognition or machine translation TTS
- utputs are continuous, and output sequences are usually much longer
than those of the input. These attributes cause prediction errors to accumulate quickly.
Tacotron
- An end-to-end generative TTS model based on the sequence-to-
sequence (seq2seq) with attention paradigm.
- Tacotron takes characters as input and outputs raw spectrogram.
- It does not require phoneme-level alignment, so it can easily scale to
using large amounts of acoustic data with transcripts.
- With a simple waveform synthesis technique, Tacotron produces a 3.82
mean opinion score (MOS) on an US English eval set, outperforming a production parametric system in terms of naturalness.
Tacotron II
- Entirely neural
- Uses WaveNet as vocoder
- Achieves a MOS of 4.53 comparable to a MOS of 4.58for professionally
recorded speech.
Tacotron II Architecture
Samples
Sentence Tacotron II Tacotron I Generative adversarial network or variational auto-encoder. He has read the whole thing. He reads books. Thisss isrealy awhsome. This is your personal assistant, Google Home. This is your personal assistant Google Home. The buses aren't the problem, they actually provide a solution. The buses aren't the PROBLEM, they actually provide a SOLUTION. The quick brown fox jumps over the lazy dog. Does the quick brown fox jump over the lazy dog?
End-to-End Tacotron II Samples (Persian)
متسه ،شابداش دماح ،نم ،ملبس. اب قرش زا ،ناتسنمکرت و ناتسنمرا ،ناجيابرذآ يروهمج اب لامش زا ناريا زا نينچمه و تسا هياسمه قارع و هيکرت اب برغ زا و ناتسکاپ و ناتسناغفا جيلخ هب بونج زا و رزخ يايرد هب لامش يمسر ملبعا کي طقف ،رازس ندمآ ات و هدش لح تلبکشم يمامت ًارهاظ تسا هدنام يقاب .
Vocoders
- WaveNet
- Parallel WaveNet
- WaveGlow
- MelGAN
- WaveRNN
- LPCNet
- Etc.
WaveNet
- Fully convolutional autoregressive
- Fast at training but slow at inference time
WaveGlow
- WaveGlow combines insights from Glow and WaveNet.
- Produces audio samples at a rate of more than 500 kHz on an NVIDIA
V100 GPU.
Tacotron 2 + WaveGlow Samples
MelGAN
- Non-autoregressive feed-forward convolutional architecture to
perform audio waveform generation in a GAN setup.
- MelGAN is substantially faster than other mel-spectrogram inversion
- alternatives. In particular, it is 10 times faster than the fastest available
model to date without considerable degradation in audio quality.