TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger - - PowerPoint PPT Presentation

tacotron 2 and waveglow with tensor cores
SMART_READER_LITE
LIVE PREVIEW

TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger - - PowerPoint PPT Presentation

TEXT-TO-SPEECH SYNTHESIS USING TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES Rafael Valle, Ryan Prenger and Yang Zhang 1 OUTLINE 1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and TensorCores 2 TEXT TO SPEECH SYNTHESIS (TTS) Global


slide-1
SLIDE 1

1

TEXT-TO-SPEECH SYNTHESIS USING TACOTRON 2 AND WAVEGLOW WITH TENSOR CORES

Rafael Valle, Ryan Prenger and Yang Zhang

slide-2
SLIDE 2

2

OUTLINE

1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and TensorCores

slide-3
SLIDE 3

3

TEXT TO SPEECH SYNTHESIS (TTS)

0.5 1 1.5 2 2.5 3 3.5 USD Billions

Global TTS Market Value 1

2016 2022

Apple Siri Microsoft Cortana Amazon Alexa / Polly Nuance Vocalizer Google TTS

1 https://www.marketsandmarkets.com/PressReleases/text-to-speech.asp

Human to ? Interaction

slide-4
SLIDE 4

4

APPLICATIONS OF TTS

Smart Home Devices Audio Books Video Games Self-Driving Cars Vocaloids Health Care

slide-5
SLIDE 5

5

TEXT TO SPEECH SYNTHESIS

for - ty

per - c

  • f

a ent Text Input Forty percent of a Speech Output

slide-6
SLIDE 6

6

SPEECH SYNTHESIS: THE VODER 1939

slide-7
SLIDE 7

7

PARAMETRIC SPEECH SYNTHESIS

Pneumatic speech synthesizer developed by von Kempelen in 1791. Voder speech synthesizer developed by Homer Dudley in 1939.

slide-8
SLIDE 8

8

CONCATENATIVE TTS SYNTHESIS

First practical application in 1936:

British Phone company’s Talking Clock

for - ty

per -

c -

  • f

a ent Database

slide-9
SLIDE 9

9

CONCATENATIVE TTS SYNTHESIS

https://wezs.com/~danguy/monguy/TTS.html

  • Requires collecting speech units
  • Requires designing cost heuristics
  • Requires acoustic processing
slide-10
SLIDE 10

10

PARAMETRIC (DEEP LEARNING) TTS SYNTHESIS

for - ty

per - c

  • f

a ent Text Input Forty percent of a Audio Output Deep Learning

slide-11
SLIDE 11

11

DEEP LEARNING TTS SYNTHESIS

Linguistic or Acoustic features for - ty

per - c

  • f

a ent Text Input Forty percent of a Audio Output

X

1º 2º

slide-12
SLIDE 12

12

OUTLINE

1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and TensorCores

slide-13
SLIDE 13

13

TEXT TO (MEL) SPECTROGRAM WITH TACOTRON

Tacotron

CBHG:

Convolution Bank (k=[1, 2, 4, 8…])

Convolution stack (ngram like)

Highway

bi-directional GRU

Tacotron 2

Location sensitive attention, i.e. attend to: Memory (encoder output) Query (decoder output)

Location (attention weights)

Cumulative attention weights (+= )

slide-14
SLIDE 14

14

Implementations https://github.com/NVIDIA/tacotron2/ https://github.com/NVIDIA/OpenSeq2Seq/ Deep Learning Framework and Libraries – PyTorch – TensorFlow – NVIDIA’s Automatic Mixed Precision Training Setup – NVIDIA’s Tesla V100 – Good results in less than a day starting fresh – Good results in a few hours warm-starting

slide-15
SLIDE 15

15

TTS DATASET

LJS (Linda Johnson: single native speakers, ~24 hours)

  • 7 non-fiction books
  • “All of my recordings were done from the sofa in my family room!”
  • “All of my recordings were done on a MacBook Pro.”
  • https://keithito.com/LJ-Speech-Dataset/
  • https://librivox.org/reader/11049

Sometimes raw text, other times ARPAbet

slide-16
SLIDE 16

16

MEL TO AUDIO WITH WAVENET

Sampling Rates 44100 Hz

22050 Hz

16000 Hz

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

slide-17
SLIDE 17

17

WAVENET IMPLEMENTATION DETAILS

Naïve PyTorch -> 20 samples per second Inference PyTorch on Volta -> 200 samples per second nv-wavenet -> 20000 samples per second

slide-18
SLIDE 18

18

MEAN OPINION SCORES: TACOTRON AND WAVENET

https://arxiv.org/abs/1712.05884

slide-19
SLIDE 19

19

OUTLINE

1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and TensorCores

slide-20
SLIDE 20

20

WAVENET IS THE BOTTLENECK

Ping, W. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. https://arxiv.org/abs/1710.07654 Shen, J. Et al. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. https://arxiv.org/abs/1712.05884

TacoTron2 DeepVoice 3

slide-21
SLIDE 21

21

WAVENET IS THE BOTTLENECK

Ping, W. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. https://arxiv.org/abs/1710.07654 Shen, J. Et al. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. https://arxiv.org/abs/1712.05884

TacoTron2 DeepVoice 3

slide-22
SLIDE 22

22

AUTO-REGRESSION IS INHERENTLY SERIAL

van den Oord, A. WaveNet: A Generative Model for Raw Audio. https://arxiv.org/pdf/1609.03499.pdf

𝑄 𝑦0, 𝑦1, 𝑦2, … = 𝑄 𝑦0 𝑄 𝑦1 𝑦0)𝑄 𝑦2 𝑦1, 𝑦0 …

slide-23
SLIDE 23

23

AUTO-REGRESSION IS INHERENTLY SERIAL

NV-WaveNet

https://github.com/NVIDIA/nv-wavenet

van den Oord, A. WaveNet: A Generative Model for Raw Audio. https://arxiv.org/pdf/1609.03499.pdf

𝑄 𝑦0, 𝑦1, 𝑦2, … = 𝑄 𝑦0 𝑄 𝑦1 𝑦0)𝑄 𝑦2 𝑦1, 𝑦0 …

slide-24
SLIDE 24

24

TRANSFORMING WHITENOISE TO AUDIO IS PARALLEL

Mel-Spectrogram Gaussian Noise

slide-25
SLIDE 25

25

AUTO-ENCODER (APPROXIMATING LIKELIHOOD)

Loss 1 Loss 2 Gaussian Noise Mel-Spectrogram

slide-26
SLIDE 26

26

INVERTIBLE NETWORK (EXACT LIKELIHOOD)

Mel-Spectrogram Gaussian Noise Loss 1

slide-27
SLIDE 27

27

HOW TO MAKE A NETWORK INVERTIBLE

audio samples

slide-28
SLIDE 28

28

HOW TO MAKE A NETWORK INVERTIBLE

audio samples

slide-29
SLIDE 29

29

HOW TO MAKE A NETWORK INVERTIBLE

slide-30
SLIDE 30

30

HOW TO MAKE A NETWORK INVERTIBLE

Coupling network (s, b) (s, b) (s, b) (s, b) (s, b) (s, b)

slide-31
SLIDE 31

31

HOW TO MAKE A NETWORK INVERTIBLE

Coupling network (s, b) (s, b) (s, b) (s, b) (s, b) (s, b) + b s ● + b s ● + b s ● + b s ● + b s ● + b s ●

slide-32
SLIDE 32

32

HOW TO MAKE A NETWORK INVERTIBLE

Coupling network (s, b) (s, b) (s, b) (s, b) (s, b) (s, b)

  • b) / s

(

  • b) / s
  • b) / s
  • b) / s
  • b) / s
  • b) / s

( ( ( ( (

slide-33
SLIDE 33

33

https://github.com/NVIDIA/waveglow

slide-34
SLIDE 34

34

DECREASING TEMPERATURE CAN HELP

Mel-Spectrogram Gaussian Noise

𝜏 ~ 0.8

slide-35
SLIDE 35

35

PARALLEL SOLUTION WORKS

Loss

NV-WaveNet: 24-48khz (1.2x – 2.4x realtime)

WaveGlow (published): 520 khz (24.5x realtime)

slide-36
SLIDE 36

36

PARALLEL SOLUTION WORKS

Loss

NV-WaveNet: 24-48khz (1.2x – 2.4x realtime)

WaveGlow (published): 520 khz (24.5x realtime) WaveGlow (internal smaller): 1,500 khz (70x realtime)

slide-37
SLIDE 37

37

RELATED WORK

Parallel WaveNet/ClariNet Very similar network/inference Very different training procedure WaveRNN More like optimized auto-regressive Can get some parallelism with subscale trick

slide-38
SLIDE 38

38

OUTLINE

1.Text to Speech Synthesis 2.Tacotron 2 3.WaveGlow 4.TTS and Tensor Cores

slide-39
SLIDE 39

39

INFERENCE SPEED UP

1 2 3 w/o Tensor Cores w/ Tensor Cores samples/s in MHz On DGX-1 1 Tesla V100 GPU Batch size: 1

with Tensor Cores – Automatic Mixed Precision

1.8x

slide-40
SLIDE 40

40

INFERENCE SPEED UP

1 2 3 real-time w/o Tensor Cores w/ Tensor Cores

1x

samples/s in MHz On DGX-1 1 Tesla V100 GPU Batch size: 1

with Tensor Cores – Automatic Mixed Precision

70x 125x

slide-41
SLIDE 41

41

TENSOR CORES SPEED UP MATRIX MULTIPLICATIONS

FP16 x FP16 + FP32

x

slide-42
SLIDE 42

42

w/o Tensor Cores w/ Tensor Cores Inference time 29ms 15ms

2X FASTER INFERENCE WITH TENSOR CORES

slide-43
SLIDE 43

43

TRAINING SPEED UP

200 400 600 800 1000

FP32 Tensor Cores training time in hours On DGX-1 1 Tesla V100 GPU

  • ver 1000 Epochs

with Tensor Cores – Automatic Mixed Precision

1.9x

faster

slide-44
SLIDE 44

44

TRAINING WITH TENSOR CORES

  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

1 0k 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k

FP32 Tensor Cores Loss Iterations Tensor Cores achieve similar training loss

slide-45
SLIDE 45

45

USING TENSOR CORES WITH AMP

Automatic Mixed Precision library that enables Tensor Cores transparently

manages type conversions and master weights automatic loss scaling to prevents gradient underflow

Different levels of optimization

white/black list allow user to enforce precision

Easy code adjustment

slide-46
SLIDE 46

46

INFERENCE WITH AMP IS EASY

Code Example

FP32

slide-47
SLIDE 47

47

INFERENCE WITH AMP IS EASY

Code Example

FP32 Tensor Cores with AMP

1.8x 1x

slide-48
SLIDE 48

48

TRAINING WITH AMP IS EASY

Code Example

FP32

slide-49
SLIDE 49

49

TRAINING WITH AMP IS EASY

Code Example

Tensor Cores with AMP

1.9x speed up

slide-50
SLIDE 50

50

CONCLUSION

Tensor Cores achieve close to 2x faster inference and training on Waveglow AMP enables Tensor Cores transparently for training and inference Code available on NGC and github

https://ngc.nvidia.com/catalog/model-scripts/ https://github.com/NVIDIA/tacotron2 https://github.com/NVIDIA/waveglow https://github.com/NVIDIA/apex/tree/master/apex/amp

slide-51
SLIDE 51