Tacotron: End-to-End TTS Tacotron [Wang 2017]: Neural Vocoder - PowerPoint PPT Presentation

! " Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron RJ Skerry-Ryan, Eric Battenberg , Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous Google AI Audio Test

Tacotron: End-to-End TTS • Tacotron [Wang 2017]: Neural Vocoder • Convert spectrogram to samples using mel spectrogram with r=3 Griffin-Lim algorithm. Decoder Decoder Decoder • End-to-end TTS sounds pretty good . Transcript RNN RNN RNN Embedding Attention Mechanism Attention Attention Attention • Tacotron 2 [Shen 2017]: RNN RNN RNN Pre-net Pre-net Pre-net • Convert spectrogram to samples using Transcript WaveNet Encoder <GO> frame • End-to-end TTS can sound really good . Character/Phone Embeddings • Is TTS Solved? 2

Prosody in Speech • What’s prosody ? • Intonation, rhythm, pitch, stress, loudness. • Conveys emotion, emphasis, and additional meaning. • Examples: • The cat sat on the mat . • End-to-end TTS sounds pretty good . Prosody isn’t: • What ’s being said. • Our working definition (subtractive): • Who ’s saying it. • Where it’s being said. Prosody is: • How it’s said. 3

Prosody Transfer • Various way to control prosody: • Prosody annotations (e.g., ToBI) • Linguistic features (pitch, energy, duration). TTS • Prosody transfer (“Say it like this”) • Prosody transfer desired features: • Pitch relative transfer (output is within a speaker’s natural pitch range). • Robust to text transformations (one reference for many sentences, makes it scalable). • Meaningful embedding space (for sampling or control via other systems). 4

End-to-End Prosody Transfer • Prosody Embeddings are computed using a Reference Encoder. • Speaker embeddings are used for multi-speaker models. • Both are broadcast-concatenated to the transcript embeddings. • Reference and target speaker are the same during training. (but can be different during inference) Neural Vocoder Neural Vocoder Neural Vocoder Prosody Speaker Transcript Prosody Prosody Speaker Speaker Transcript Transcript Embedding Embedding Embedding Embedding Embedding Embedding Embedding Embedding Embedding mel spectrogram mel spectrogram mel spectrogram with r=3 with r=3 with r=3 Attention Context Attention Context Attention Context Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder RNN RNN RNN RNN RNN RNN RNN RNN RNN Attention Attention Attention Reference Embedding Transcript Attention Attention Attention Attention Attention Attention Reference Reference Embedding Embedding Transcript Transcript RNN RNN RNN RNN RNN RNN RNN RNN RNN Encoder Lookup Encoder Encoder Encoder Lookup Lookup Encoder Encoder Pre-net Pre-net Pre-net Pre-net Pre-net Pre-net Pre-net Pre-net Pre-net Reference Speaker Character/Phone Reference Reference Speaker Speaker Character/Phone Character/Phone Spectrogram Slices Embeddings Embeddings Spectrogram Slices Spectrogram Slices Embeddings Embeddings Embeddings Embeddings <GO> frame <GO> frame <GO> frame 5

Prosody Encoder Activation • Input: mel spectrogram Final GRU State • Strided 2D convolutions 128-unit GRU • (Make sure they’re padding invariant) • RNN aggregation (GRU) • Summarize conv features into a single vector. 6-Layer Strided Conv2D • Fully connected + activation (tanh) w/ BatchNorm • Project vector to desired dimensionality. reference spectrogram slices 6

Experiment Setup • Datasets: • Single-speaker audiobook, 147 hours, emotive speech (Blizzard Challenge) • Multi-speaker voice assistant, 296 hours, 44 English speakers (Proprietary) • (Some) Training details: • Train for at least 200k steps with batch size 256 and Adam optimizer (3-4 days). 7

Evaluation Metrics • How well does the prosody embedding capture prosodic variation? • Compare synthesized audio with reference audio. • Quantitative metrics: • Mel Cepstral Distortion (MCD 13 ): Sum squared differences over first 13 MFCCs. • F0 Frame Error (FFE): Percentage of frames with either a >20% pitch error or a voicing decision error. • Subjective evaluation: • Anchored side-by-side prosody similarity comparisons on a scale of [-3 to 3] 8

Evaluation Results The tanh-128 model uses a 128-dimensional prosody embedding. 9

Audio Examples Prosody Text Reference Baseline Embedding Single-speaker model: Reference from unseen speaker Aus F Les Les The past, the present, and the future walk into a bar. It was tense. Multi-speaker model: Reference from seen speaker Aus F US F GB F Ind F US F GB F Ind F Is that Utah travel agency? Ind F US F Aus F GB F US F Aus F GB F Only one was deployed, while they need a hundred teams. Multi-speaker model: Reference from unseen speaker Les Aus F GB F US M Aus F GB F US M It will be good for both of you. Les Aus F GB F US M Aus F GB F US M I've swallowed a pollywog. More audio examples available at: https://google.github.io/tacotron/publications/end_to_end_prosody_transfer/ 10

Is Speaker Identity Preserved? • Simple speaker classifier is 99% accurate on ground truth and baseline output. • But for the prosody model, it only chooses the target speaker 20% of the time. • (Chooses the reference speaker 61% of the time.) • Speaker identity is entangled with prosody in a complicated way. • Preserving a target speaker’s pitch range is a more concrete goal. Reference Baseline Transfer Female-male Male-female 11

Robustness to Text Transformations Prosody Text Reference Baseline Embedding Reference: “I can now,” said the Leopard . Perturbed: “I can now,” said the Porcupine . Reference: For the first time in her life she had been danced tired . Perturbed: For the last time in his life he had been handily embarrassed . Reference: Second --Her family was very ancient and noble . Perturbed: First --Her family was very sarcastic and horrible . Reference: Never again shall Eleanor Lavish be a friend of mine. Perturbed: Never again shall Bartholomew Bigglesby be a son of mine. Reference: Alice was not much surprised at this, she was getting so used to queer things happening . Perturbed: Eric was not much surprised at this, he was getting so used to TensorFlow breaking . 12

More Audio Examples!!? • Come check out our poster (#43) for more. • A final fun example! • There are no examples of singing in the single-speaker training data. • What if the reference contains singing? Prosody Reference Baseline Embedding Text: Sweet dreams are made of these. Friendly Assistants who work hard to please. 13 More audio examples available at: https://google.github.io/tacotron/publications/end_to_end_prosody_transfer/

Summary • Prosody is a very important aspect of speech. • Prosody transfer is a natural interface for prosody control. • End-to-end prosody transfer works well and is robust to text transformations. • Pitch-relative prosody transfer is a goal for future work. • Stick around for the Style Tokens talk next! 14

References • [1] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards End-to-End Speech Synthesis,” arXiv.org , vol. cs.CL. 29-Mar-2017. • [2] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” arXiv.org , vol. cs.CL. 15-Dec-2017. • [3] A. Graves, “Generating Sequences With Recurrent Neural Networks,” arXiv.org. 04-Aug- 2013. • [4] Y. Wang, D. Stanton, Y. Zhang, R. J. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis,” arXiv.org , vol. cs.CL. 23-Mar-2018. 15

Extra Slides 16

Tacotron Configuration • Transcript Encoder: Neural Vocoder • Phoneme inputs mel spectrogram with r=3 • CBHG [Wang 2017] Decoder Decoder Decoder Transcript RNN RNN RNN Embedding Attention Mechanism • Attention Mechanism: Attention Attention Attention RNN RNN RNN Pre-net Pre-net Pre-net • GMM [Graves 2013] Transcript Encoder • Sample Generation: <GO> frame Character/Phone • Griffin-Lim or WaveNet Embeddings 17

Visual Comparisons Text: Snuffles is a lot happier. And smells a lot better. Reference Prosody Embedding Baseline 18

Tacotron: End-to-End TTS Tacotron [Wang 2017]: Neural Vocoder - PowerPoint PPT Presentation

! " Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron RJ Skerry-Ryan, Eric Battenberg , Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous Google AI Audio Test

TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger and Yang Zhang 1 OUTLINE

General Presentation Kormarine/Glovis Conference Oct 2017 TTS Services Vision and Mission TTS

DNN Based TTS Systems TTS Architecture: Traditional Pipeline Typical statistical parametric

Framework for Temporal Tunnel Services (TTS) draft-chen-teas-frmwk-tts-00 Huaimo Chen

TTS for Hankook Tire (Pictures 1) Bird view of Pyrolysis Plants in Hankook tire in Planing stage

tts r trs Ptrs t

Package Management with Package Management with Package Management with Anaconda Anaconda

trt tts

End-to-end approach to ASR, TTS and Speech Translation Satoshi Nakamura 1,2 with Sakriani Sakti

Writing reliable end to end tests End to end browser tests They take a long time to run. Around

Models, Over-approximations and Robustness Eugenio Moggi DIBRIS, Genova Univ. Rennes, 2020-05-14

End-to-end IoT Platform Connect Collect Manage Learn Analyze Act End-to-end Solution

Appendix Deposits and Loans Appendix Deposits (ending balance) Loans (ending balance) (Unit:

Is End-to-End Integrity Verification Really End- to-End? Ahmed Alhussen, Batyr Charyyev, and Engin

End-to-End Argument Jeff Chase Duke University End-To-End Argument Application TCP Where to

Cirrus: A Serverless Framework for End-to-end ML Workflows Joao Carreira , Pedro Fonseca, Alexey

Introducing Precautionary Behavior by Temporal Diversion of Voter Attention from Casting to

Differential Attention to Attributes in Utility-Theoretic Choice Models Trudy Ann Cameron J.R.

Right now, we pay attention to only 2 things 1) Scary and uncertain news And anything

Conditional Neural Language Models Karl Stratos Rutgers University Karl Stratos CS 533:

CS480/680 Lecture 18: July 8, 2019 Recurrent and Recursive Neural Networks [GBC] Chap. 10

1 Planning Conside r ations F unding is c o ming o ut Co uld b e re c e iving ne w q uic

A Mathematical View of Attention Models in Deep Learning Shuiwang Ji, Yaochen Xie Department of

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid