Unsupervised acoustic unit discovery for speech synthesis using - - PowerPoint PPT Presentation

unsupervised acoustic unit discovery for speech synthesis
SMART_READER_LITE
LIVE PREVIEW

Unsupervised acoustic unit discovery for speech synthesis using - - PowerPoint PPT Presentation

Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks Interspeech 2019, Graz, Austria Ryan Eloff, Andr e Nortje, Benjamin van Niekerk, Avashna Govender, Leanne Nortje, Arnu Pretorius, Elan


slide-1
SLIDE 1

Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks

Interspeech 2019, Graz, Austria

Ryan Eloff, Andr´ e Nortje, Benjamin van Niekerk, Avashna Govender, Leanne Nortje, Arnu Pretorius, Elan van Biljon, Ewald van der Westhuizen, Lisa van Staden, Herman Kamper Stellenbosch University, South Africa & University of Edinburgh, UK https://github.com/kamperh/suzerospeech2019

slide-2
SLIDE 2

Advances in speech recognition

1 / 14

slide-3
SLIDE 3

Advances in speech recognition

  • Addiction to text: 2000 hours transcribed speech audio;

∼350M/560M words text [Xiong et al., TASLP’17]

1 / 14

slide-4
SLIDE 4

Advances in speech recognition

  • Addiction to text: 2000 hours transcribed speech audio;

∼350M/560M words text [Xiong et al., TASLP’17]

  • Sometimes not possible, e.g., for unwritten languages

1 / 14

slide-5
SLIDE 5

Advances in speech recognition

  • Addiction to text: 2000 hours transcribed speech audio;

∼350M/560M words text [Xiong et al., TASLP’17]

  • Sometimes not possible, e.g., for unwritten languages
  • Very different from the way human infants learn language

1 / 14

slide-6
SLIDE 6

Zero-Resource Speech Challenges (ZRSC)

2 / 14

slide-7
SLIDE 7

Zero-Resource Speech Challenges (ZRSC)

2 / 14

slide-8
SLIDE 8

ZRSC 2019: Text-to-speech without text

Waveform generator Target voice ‘the dog ate the ball’

3 / 14

slide-9
SLIDE 9

ZRSC 2019: Text-to-speech without text

Acoustic model

7 11 26 31

Waveform generator Target voice

11

3 / 14

slide-10
SLIDE 10

What do we get for training?

4 / 14

slide-11
SLIDE 11

What do we get for training?

No labels

4 / 14

slide-12
SLIDE 12

What do we get for training?

No labels :)

4 / 14

slide-13
SLIDE 13

What do we get for training?

No labels :)

Figure adapted from: http://zerospeech.com/2019 4 / 14

slide-14
SLIDE 14

Approach: Compress, decode and synthesise

Encoder Discretise FFTNet Decoder z1:N h1:N x1:T ˆ y1:T MFCCs Filterbanks Waveform Vocoder Compression model Symbol-to-speech module Speaker ID Embed 5 / 14

slide-15
SLIDE 15

Approach: Compress, decode and synthesise

Encoder Discretise FFTNet Decoder z1:N h1:N x1:T ˆ y1:T MFCCs Filterbanks Waveform Vocoder Compression model Symbol-to-speech module Training speaker Embed 5 / 14

slide-16
SLIDE 16

Approach: Compress, decode and synthesise

Encoder Discretise FFTNet Decoder z1:N h1:N x1:T ˆ y1:T MFCCs Filterbanks Waveform Vocoder Compression model Symbol-to-speech module Target speaker Embed 5 / 14

slide-17
SLIDE 17

Approach: Compress, decode and synthesise

Encoder Discretise FFTNet Decoder z1:N h1:N x1:T ˆ y1:T MFCCs Filterbanks Waveform Vocoder Compression model Symbol-to-speech module Speaker ID Embed 5 / 14

slide-18
SLIDE 18

Approach: Compress, decode and synthesise

Encoder Discretise FFTNet Decoder z1:N h1:N x1:T ˆ y1:T MFCCs Filterbanks Waveform Vocoder Compression model Symbol-to-speech module Speaker ID Embed 5 / 14

slide-19
SLIDE 19

Discretisation methods

  • Straight-through estimation (STE)

binarisation:

  • Categorical variational autoencoder

(CatVAE):

  • Vector-quantised variational

autoencoder (VQ-VAE):

6 / 14 0.9 −0.1 0.3 0.7 −0.8 h threshold 1 −1 1 1 −1 z 0.9 −0.1 0.3 0.7 −0.8 h z 0.86 0.01 0.02 0.11 0.00

e(hk+gk)/τ K

k=1 e(hk+gk)/τ

0.9 −0.1 0.3 0.7 −0.8 h z 0.8 −0.2 0.3 0.5 −0.6 Choose closest embedding e

slide-20
SLIDE 20

Neural network architectures

  • Encoder: Convolutional layers, each layer with a stride of 2
  • Decoder: Transposed convolutions mirroring encoder
  • Waveform generation: FFTNet autoregressive vocoder
  • Also experimented with WaveNet: Sometimes gave noisy output
  • Bitrate: Set by number of symbols K and number of striding layers

7 / 14

slide-21
SLIDE 21

Evaluation

Human evaluation metrics:

  • Mean opinion score (MOS)
  • Character error rate (CER)
  • Similarity to the target speaker’s voice

8 / 14

slide-22
SLIDE 22

Evaluation

Human evaluation metrics:

  • Mean opinion score (MOS)
  • Character error rate (CER)
  • Similarity to the target speaker’s voice

Objective evaluation metrics:

  • ABX discrimination
  • Bitrate

8 / 14

slide-23
SLIDE 23

Evaluation

Human evaluation metrics:

  • Mean opinion score (MOS)
  • Character error rate (CER)
  • Similarity to the target speaker’s voice

Objective evaluation metrics:

  • ABX discrimination
  • Bitrate

Two evaluation languages:

  • English: Used for development
  • Indonesian: Held out “surprise language”

8 / 14

slide-24
SLIDE 24

ABX on English with speaker conditioning

STE VQ-VAE CatVAE 10 20 30 ABX (%)

no speaker cond. speaker conditioning

9 / 14

slide-25
SLIDE 25

ABX on English for different compression rates

64 64 64 256 256 256 512 512 512 STE VQ-VAE CatVAE 10 20 30 ABX (%)

no downsampling

10 / 14

slide-26
SLIDE 26

ABX on English for different compression rates

64 64 64 256 256 256 512 512 512 STE VQ-VAE CatVAE 10 20 30 ABX (%)

no downsampling ×4 downsample

10 / 14

slide-27
SLIDE 27

ABX on English for different compression rates

64 64 64 256 256 256 512 512 512 STE VQ-VAE CatVAE 10 20 30 ABX (%)

no downsampling ×4 downsample ×8 downsample

10 / 14

slide-28
SLIDE 28

ABX on English for different compression rates

64 64 64 256 256 256 512 512 512 STE VQ-VAE CatVAE 10 20 30 ABX (%)

64 116 473 79 154 644 85 164 682 75 139 576 93 188 770 100 190 750 70 124 478 90 194 646 103 215 686 no downsampling ×4 downsample ×8 downsample

10 / 14

slide-29
SLIDE 29

Official evaluation results

CER MOS Similarity ABX Model (%) [1, 5] [1, 5] (%) Bitrate English: DPGMM-Merlin 75 2.50 2.97 35.6 72 VQ-VAE-x8 75 2.31 2.49 25.1 88 VQ-VAE-x4 67 2.18 2.51 23.0 173 Supervised 44 2.77 2.99 29.9 38 Indonesian: DPGMM-Merlin 62 2.07 3.41 27.5 75 VQ-VAE-x8 58 1.94 1.95 17.6 69 VQ-VAE-x4 60 1.96 1.76 14.5 140 Supervised 28 3.92 3.95 16.1 35

11 / 14

slide-30
SLIDE 30

Synthesised examples

Model Input Synthesised output Target speaker English: VQ-VAE-x4

Play Play Play

VQ-VAE-x4-new

Play

VQ-VAE-x4

Play Play Play

VQ-VAE-x4-new

Play

Indonesian: VQ-VAE-x4

Play Play Play

VQ-VAE-x4-new

Play

VQ-VAE-x4

Play Play Play

VQ-VAE-x4-new

Play

12 / 14

slide-31
SLIDE 31

Conclusions

  • Speaker conditioning consistently improves performance
  • Different discretisation methods are similar (VQ-VAE slightly better)
  • Different models difficult to compare because of bitrate
  • Future: Does discritisation actually benefit feature learning?

13 / 14

slide-32
SLIDE 32

Why do we have ten authors on this paper?

Ryan Eloff Andr´ e Nortje Benjamin van Niekerk Avashna Govender Leanne Nortje Arnu Pretorius Elan van Biljon Ewald van der Westhuizen Lisa van Staden Herman Kamper

14 / 14

slide-33
SLIDE 33

https://github.com/kamperh/suzerospeech2019

slide-34
SLIDE 34

https://github.com/kamperh/suzerospeech2019 (Update coming soon)

slide-35
SLIDE 35

Straight-through estimation (STE) binarisation

  • STE binarisation:

zk = 1 if hk ≥ 0 or zk = −1 otherwise

  • For backpropagation we need: ∂J

∂h

  • For single element: ∂J

∂hk = ∂zk ∂hk ∂J ∂zk

  • What is ∂zk

∂hk with zk = threshold(hk)? Cannot solve directly

  • Idea: If zk ≈ hk then we could use ∂J

∂hk ≈ ∂J ∂zk

16 / 14

0.9 −0.1 0.3 0.7 −0.8 1 −1 1 1 −1 h z h4 z4 threshold

slide-36
SLIDE 36

Straight-through estimation (STE) binarisation

As an example, let us say hk = 0.7:

−1 0.7 1 17 / 14

slide-37
SLIDE 37

Straight-through estimation (STE) binarisation

Instead of direct thresholding, let us set zk = 1 with probability 0.85 and zk = −1 with probability 0.15:

−1 0.7 1

Estimated mean of zk over 500 samples: 0.668

18 / 14

slide-38
SLIDE 38

Straight-through estimation (STE) binarisation

  • So, instead of direct thresholding, we set zk = hk + ǫ, where ǫ is

sampled noise: ǫ =

  • 1 − hk

with probability 1+hk

2

−hk − 1 with probability 1−hk

2

  • Since ǫ is zero-mean, the derivative of the expected value
  • f zk is: ∂E[zk]

∂hk = 1

  • Therefore, gradients are passed unchanged through the thresholding
  • peration: ∂J

∂h ≈ ∂J ∂z

19 / 14