tacotron 2 and waveglow with tensor cores
play

TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger - PowerPoint PPT Presentation

TEXT-TO-SPEECH SYNTHESIS USING TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger and Yang Zhang 1 OUTLINE 1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and TensorCores 2 TEXT TO SPEECH SYNTHESIS (TTS) Global


  1. TEXT-TO-SPEECH SYNTHESIS USING TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger and Yang Zhang 1

  2. OUTLINE 1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and TensorCores 2

  3. TEXT TO SPEECH SYNTHESIS (TTS) Global TTS Market Value 1 3.5 3 2.5 2 Human to ? Interaction 1.5 1 0.5 0 USD Billions 2016 2022 Apple Microsoft Nuance Amazon Google Siri Cortana Vocalizer Alexa / Polly TTS 1 https://www.marketsandmarkets.com/PressReleases/text-to-speech.asp 3

  4. APPLICATIONS OF TTS Smart Home Devices Health Care Audio Books Vocaloids Self-Driving Cars Video Games 4

  5. TEXT TO SPEECH SYNTHESIS Text Input Forty percent of a Speech Output per - c for - ty ent of a - 5

  6. SPEECH SYNTHESIS: THE VODER 1939 6

  7. PARAMETRIC SPEECH SYNTHESIS Pneumatic speech synthesizer developed Voder speech synthesizer developed by von Kempelen in 1791. by Homer Dudley in 1939. 7

  8. CONCATENATIVE TTS SYNTHESIS Database First practical application in 1936: British Phone company’s Talking Clock per - c - ent of a for - ty 8

  9. CONCATENATIVE TTS SYNTHESIS • Requires collecting speech units • Requires designing cost heuristics • Requires acoustic processing https://wezs.com/~danguy/monguy/TTS.html 9

  10. PARAMETRIC (DEEP LEARNING) TTS SYNTHESIS Text Input Forty percent of a Deep Learning Audio Output per - c for - ty ent of a - 10

  11. DEEP LEARNING TTS SYNTHESIS Linguistic or Acoustic features Text Input Forty percent of a 2º 1º X Audio Output per - c for - ty ent of a - 11

  12. OUTLINE 1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and TensorCores 12

  13. TEXT TO (MEL) SPECTROGRAM WITH TACOTRON Tacotron Tacotron 2 CBHG: Location sensitive attention, i.e. attend to: Convolution Bank (k=[1, 2, 4, 8…]) Memory (encoder output) Convolution stack (ngram like) Query (decoder output) Highway Location (attention weights) bi-directional GRU Cumulative attention weights (+= ) 13

  14. Implementations https://github.com/NVIDIA/tacotron2/ https://github.com/NVIDIA/OpenSeq2Seq/ Deep Learning Framework and Libraries – PyTorch – TensorFlow – NVIDIA’s Automatic Mixed Precision Training Setup – NVIDIA’s Tesla V100 – Good results in less than a day starting fresh – Good results in a few hours warm-starting 14

  15. TTS DATASET LJS (Linda Johnson: single native speakers, ~24 hours) ● 7 non-fiction books “All of my recordings were done from the sofa in my family room!” ● “All of my recordings were done on a MacBook Pro.” ● ● https://keithito.com/LJ-Speech-Dataset/ ● https://librivox.org/reader/11049 Sometimes raw text, other times ARPAbet 15

  16. MEL TO AUDIO WITH WAVENET Samplin g Rates 44100 Hz 22050 Hz 16000 Hz https://deepmind.com/blog/wavenet-generative-model-raw-audio/ 16

  17. WAVENET IMPLEMENTATION DETAILS Naïve PyTorch -> 20 samples per second Inference PyTorch on Volta -> 200 samples per second nv-wavenet -> 20000 samples per second 17

  18. MEAN OPINION SCORES: TACOTRON AND WAVENET https://arxiv.org/abs/1712.05884 18

  19. OUTLINE 1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and TensorCores 19

  20. WAVENET IS THE BOTTLENECK TacoTron2 DeepVoice 3 Ping, W. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. https://arxiv.org/abs/1710.07654 Shen, J. Et al. Natural TTS Synthesis by Conditioning WaveNet on 20 Mel Spectrogram Predictions. https://arxiv.org/abs/1712.05884

  21. WAVENET IS THE BOTTLENECK TacoTron2 DeepVoice 3 Ping, W. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. https://arxiv.org/abs/1710.07654 Shen, J. Et al. Natural TTS Synthesis by Conditioning WaveNet on 21 Mel Spectrogram Predictions. https://arxiv.org/abs/1712.05884

  22. AUTO-REGRESSION IS INHERENTLY SERIAL = 𝑄 𝑦 0 𝑄 𝑦 1 𝑦 0 )𝑄 𝑦 2 𝑦 1 , 𝑦 0 … 𝑄 𝑦 0 , 𝑦 1 , 𝑦 2 , … van den Oord, A. WaveNet: A Generative Model for Raw Audio. 22 https://arxiv.org/pdf/1609.03499.pdf

  23. AUTO-REGRESSION IS INHERENTLY SERIAL = 𝑄 𝑦 0 𝑄 𝑦 1 𝑦 0 )𝑄 𝑦 2 𝑦 1 , 𝑦 0 … 𝑄 𝑦 0 , 𝑦 1 , 𝑦 2 , … NV-WaveNet van den Oord, A. WaveNet: A Generative Model for Raw Audio. 23 https://github.com/NVIDIA/nv-wavenet https://arxiv.org/pdf/1609.03499.pdf

  24. TRANSFORMING WHITENOISE TO AUDIO IS PARALLEL Gaussian Noise Mel-Spectrogram 24

  25. AUTO-ENCODER (APPROXIMATING LIKELIHOOD) Loss 1 Gaussian Noise Mel-Spectrogram Loss 2 25

  26. INVERTIBLE NETWORK (EXACT LIKELIHOOD) Loss 1 Gaussian Noise Mel-Spectrogram 26

  27. HOW TO MAKE A NETWORK INVERTIBLE audio samples 27

  28. HOW TO MAKE A NETWORK INVERTIBLE audio samples 28

  29. HOW TO MAKE A NETWORK INVERTIBLE 29

  30. HOW TO MAKE A NETWORK INVERTIBLE (s, b) (s, b) (s, b) (s, b) (s, b) (s, b) Coupling network 30

  31. HOW TO MAKE A NETWORK INVERTIBLE s ● + b s ● + b s ● + b s ● + b s ● + b s ● + b (s, b) (s, b) (s, b) (s, b) (s, b) (s, b) Coupling network 31

  32. HOW TO MAKE A NETWORK INVERTIBLE (s, b) (s, b) (s, b) (s, b) (s, b) (s, b) Coupling network ( - b) / s ( - b) / s ( - b) / s ( ( ( - b) / s - b) / s - b) / s 32

  33. https://github.com/NVIDIA/waveglow 33

  34. DECREASING TEMPERATURE CAN HELP 𝜏 ~ 0.8 Gaussian Noise Mel-Spectrogram 34

  35. PARALLEL SOLUTION WORKS Loss NV-WaveNet: 24-48khz (1.2x – 2.4x realtime) WaveGlow (published): 520 khz (24.5x realtime) 35

  36. PARALLEL SOLUTION WORKS Loss NV-WaveNet: 24-48khz (1.2x – 2.4x realtime) WaveGlow (published): 520 khz (24.5x realtime) WaveGlow (internal smaller): 1,500 khz (70x realtime) 36

  37. RELATED WORK Parallel WaveNet/ClariNet Very similar network/inference Very different training procedure WaveRNN More like optimized auto-regressive Can get some parallelism with subscale trick 37

  38. OUTLINE 1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and Tensor Cores 38

  39. INFERENCE SPEED UP with Tensor Cores – Automatic Mixed Precision 3 samples/s in MHz 1.8x 2 On DGX-1 1 Tesla V100 GPU Batch size: 1 1 0 w/o Tensor Cores w/ Tensor Cores 39

  40. INFERENCE SPEED UP with Tensor Cores – Automatic Mixed Precision 3 125x samples/s in MHz 2 On DGX-1 1 Tesla V100 GPU Batch size: 1 70x 1 1x 0 real-time w/o Tensor w/ Tensor Cores Cores 40

  41. TENSOR CORES SPEED UP MATRIX MULTIPLICATIONS x FP16 x FP16 + FP32 41

  42. w/o Tensor Cores w/ Tensor Cores Inference time 29ms 15ms 2X FASTER INFERENCE WITH TENSOR CORES 42

  43. TRAINING SPEED UP with Tensor Cores – Automatic Mixed Precision 1000 training time in hours 800 On DGX-1 600 1 Tesla V100 GPU over 1000 Epochs 400 1.9x faster 200 0 FP32 Tensor Cores 43

  44. TRAINING WITH TENSOR CORES FP32 Tensor Cores 1 0 -1 -2 Tensor Cores Loss -3 achieve similar training loss -4 -5 -6 -7 -8 0k 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k Iterations 44

  45. USING TENSOR CORES WITH AMP Automatic Mixed Precision library that enables Tensor Cores transparently manages type conversions and master weights automatic loss scaling to prevents gradient underflow Different levels of optimization white/black list allow user to enforce precision Easy code adjustment 45

  46. INFERENCE WITH AMP IS EASY Code Example FP32 46

  47. INFERENCE WITH AMP IS EASY Code Example Tensor Cores with AMP FP32 1x 1.8x 47

  48. TRAINING WITH AMP IS EASY Code Example FP32 48

  49. TRAINING WITH AMP IS EASY Code Example Tensor Cores with AMP 1.9x speed up 49

  50. CONCLUSION Tensor Cores achieve close to 2x faster inference and training on Waveglow AMP enables Tensor Cores transparently for training and inference Code available on NGC and github https://ngc.nvidia.com/catalog/model-scripts/ https://github.com/NVIDIA/tacotron2 https://github.com/NVIDIA/waveglow https://github.com/NVIDIA/apex/tree/master/apex/amp 50

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend