SLIDE 1 Outline: SUZero MML Talk
- Interspeech talk (for Ewald)
- Explain one technique in a bit more detail
- The experience of a coding sprint
SLIDE 2
Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks
Interspeech 2019, Graz, Austria
Ryan Eloff, Andr´ e Nortje, Benjamin van Niekerk, Avashna Govender, Leanne Nortje, Arnu Pretorius, Elan van Biljon, Ewald van der Westhuizen, Lisa van Staden, Herman Kamper Stellenbosch University, South Africa & University of Edinburgh, UK https://github.com/kamperh/suzerospeech2019
SLIDE 3 Advances in speech recognition
1 / 35
SLIDE 4 Advances in speech recognition
- Addiction to text: 2000 hours transcribed speech audio;
∼350M/560M words text [Xiong et al., TASLP’17]
1 / 35
SLIDE 5 Advances in speech recognition
- Addiction to text: 2000 hours transcribed speech audio;
∼350M/560M words text [Xiong et al., TASLP’17]
- Sometimes not possible, e.g., for unwritten languages
1 / 35
SLIDE 6 Advances in speech recognition
- Addiction to text: 2000 hours transcribed speech audio;
∼350M/560M words text [Xiong et al., TASLP’17]
- Sometimes not possible, e.g., for unwritten languages
- Very different from the way human infants learn language
1 / 35
SLIDE 7 Zero-Resource Speech Challenges (ZRSC)
2 / 35
SLIDE 8 Zero-Resource Speech Challenges (ZRSC)
2 / 35
SLIDE 9 ZRSC 2019: Text-to-speech without text
Waveform generator Target voice ‘the dog ate the ball’
3 / 35
SLIDE 10 ZRSC 2019: Text-to-speech without text
Acoustic model
7 11 26 31
Waveform generator Target voice
11
3 / 35
SLIDE 11 What do we get for training?
4 / 35
SLIDE 12 What do we get for training?
No labels
4 / 35
SLIDE 13 What do we get for training?
No labels :)
4 / 35
SLIDE 14 What do we get for training?
No labels :)
Figure adapted from: http://zerospeech.com/2019 4 / 35
SLIDE 15 Approach: Compress, decode and synthesise
Encoder Discretise FFTNet Decoder z1:N h1:N x1:T ˆ y1:T MFCCs Filterbanks Waveform Vocoder Compression model Symbol-to-speech module Speaker ID Embed 5 / 35
SLIDE 16 Approach: Compress, decode and synthesise
Encoder Discretise FFTNet Decoder z1:N h1:N x1:T ˆ y1:T MFCCs Filterbanks Waveform Vocoder Compression model Symbol-to-speech module Training speaker Embed 5 / 35
SLIDE 17 Approach: Compress, decode and synthesise
Encoder Discretise FFTNet Decoder z1:N h1:N x1:T ˆ y1:T MFCCs Filterbanks Waveform Vocoder Compression model Symbol-to-speech module Target speaker Embed 5 / 35
SLIDE 18 Approach: Compress, decode and synthesise
Encoder Discretise FFTNet Decoder z1:N h1:N x1:T ˆ y1:T MFCCs Filterbanks Waveform Vocoder Compression model Symbol-to-speech module Speaker ID Embed 5 / 35
SLIDE 19 Approach: Compress, decode and synthesise
Encoder Discretise FFTNet Decoder z1:N h1:N x1:T ˆ y1:T MFCCs Filterbanks Waveform Vocoder Compression model Symbol-to-speech module Speaker ID Embed 5 / 35
SLIDE 20 Discretisation methods
- Straight-through estimation (STE)
binarisation:
- Categorical variational autoencoder
(CatVAE):
- Vector-quantised variational
autoencoder (VQ-VAE):
6 / 35 0.9 −0.1 0.3 0.7 −0.8 h threshold 1 −1 1 1 −1 z 0.9 −0.1 0.3 0.7 −0.8 h z 0.86 0.01 0.02 0.11 0.00
e(hk+gk)/τ K
k=1 e(hk+gk)/τ
0.9 −0.1 0.3 0.7 −0.8 h z 0.8 −0.2 0.3 0.5 −0.6 Choose closest embedding e
SLIDE 21 Neural network architectures
- Encoder: Convolutional layers, each layer with a stride of 2
- Decoder: Transposed convolutions mirroring encoder
- Waveform generation: FFTNet autoregressive vocoder
- Also experimented with WaveNet: Sometimes gave noisy output
- Bitrate: Set by number of symbols K and number of striding layers
7 / 35
SLIDE 22 Evaluation
Human evaluation metrics:
- Mean opinion score (MOS)
- Character error rate (CER)
- Similarity to the target speaker’s voice
8 / 35
SLIDE 23 Evaluation
Human evaluation metrics:
- Mean opinion score (MOS)
- Character error rate (CER)
- Similarity to the target speaker’s voice
Objective evaluation metrics:
- ABX discrimination
- Bitrate
8 / 35
SLIDE 24 Evaluation
Human evaluation metrics:
- Mean opinion score (MOS)
- Character error rate (CER)
- Similarity to the target speaker’s voice
Objective evaluation metrics:
- ABX discrimination
- Bitrate
Two evaluation languages:
- English: Used for development
- Indonesian: Held out “surprise language”
8 / 35
SLIDE 25 ABX on English with speaker conditioning
STE VQ-VAE CatVAE 10 20 30 ABX (%)
no speaker cond. speaker conditioning
9 / 35
SLIDE 26 ABX on English for different compression rates
64 64 64 256 256 256 512 512 512 STE VQ-VAE CatVAE 10 20 30 ABX (%)
no downsampling
10 / 35
SLIDE 27 ABX on English for different compression rates
64 64 64 256 256 256 512 512 512 STE VQ-VAE CatVAE 10 20 30 ABX (%)
no downsampling ×4 downsample
10 / 35
SLIDE 28 ABX on English for different compression rates
64 64 64 256 256 256 512 512 512 STE VQ-VAE CatVAE 10 20 30 ABX (%)
no downsampling ×4 downsample ×8 downsample
10 / 35
SLIDE 29 ABX on English for different compression rates
64 64 64 256 256 256 512 512 512 STE VQ-VAE CatVAE 10 20 30 ABX (%)
64 116 473 79 154 644 85 164 682 75 139 576 93 188 770 100 190 750 70 124 478 90 194 646 103 215 686 no downsampling ×4 downsample ×8 downsample
10 / 35
SLIDE 30 Official evaluation results
CER MOS Similarity ABX Model (%) [1, 5] [1, 5] (%) Bitrate English: DPGMM-Merlin 75 2.50 2.97 35.6 72 VQ-VAE-x8 75 2.31 2.49 25.1 88 VQ-VAE-x4 67 2.18 2.51 23.0 173 Supervised 44 2.77 2.99 29.9 38 Indonesian: DPGMM-Merlin 62 2.07 3.41 27.5 75 VQ-VAE-x8 58 1.94 1.95 17.6 69 VQ-VAE-x4 60 1.96 1.76 14.5 140 Supervised 28 3.92 3.95 16.1 35
11 / 35
SLIDE 31 Synthesised examples
Model Input Synthesised output Target speaker English: VQ-VAE-x4
Play Play Play
VQ-VAE-x4-new
Play
VQ-VAE-x4
Play Play Play
VQ-VAE-x4-new
Play
Indonesian: VQ-VAE-x4
Play Play Play
VQ-VAE-x4-new
Play
VQ-VAE-x4
Play Play Play
VQ-VAE-x4-new
Play
12 / 35
SLIDE 32 Conclusions
- Speaker conditioning consistently improves performance
- Different discretisation methods are similar (VQ-VAE slightly better)
- Different models difficult to compare because of bitrate
- Future: Does discritisation actually benefit feature learning?
13 / 35
SLIDE 33
https://github.com/kamperh/suzerospeech2019
SLIDE 34
https://github.com/kamperh/suzerospeech2019 (Update coming soon)
SLIDE 35 Straight-through estimation (STE) binarisation
zk = 1 if hk ≥ 0 or zk = −1 otherwise
- For backpropagation we need: ∂J
∂h
∂hk = ∂zk ∂hk ∂J ∂zk
∂hk with zk = threshold(hk)? Cannot solve directly
- Idea: If zk ≈ hk then we could use ∂J
∂hk ≈ ∂J ∂zk
15 / 35
0.9 −0.1 0.3 0.7 −0.8 1 −1 1 1 −1 h z h4 z4 threshold
SLIDE 36 Straight-through estimation (STE) binarisation
As an example, let us say hk = 0.7:
−1 0.7 1 16 / 35
SLIDE 37 Straight-through estimation (STE) binarisation
Instead of direct thresholding, let us set zk = 1 with probability 0.85 and zk = −1 with probability 0.15:
−1 0.7 1
Estimated mean of zk over 500 samples: 0.668
17 / 35
SLIDE 38 Straight-through estimation (STE) binarisation
- So, instead of direct thresholding, we set zk = hk + ǫ, where ǫ is
sampled noise: ǫ =
with probability 1+hk
2
−hk − 1 with probability 1−hk
2
- Since ǫ is zero-mean, the derivative of the expected value
- f zk is: ∂E[zk]
∂hk = 1
- Therefore, gradients are passed unchanged through the thresholding
- peration: ∂J
∂h ≈ ∂J ∂z
18 / 35
SLIDE 39 Outcome of ZRSC 2019
19 / 35
SLIDE 40
Coding sprint:
Stellenbosch University ZeroSpeech (SUZero) Team
SLIDE 41 Why do we have ten authors on this paper?
Ryan Eloff Andr´ e Nortje Benjamin van Niekerk Avashna Govender Leanne Nortje Arnu Pretorius Elan van Biljon Ewald van der Westhuizen Lisa van Staden Herman Kamper
21 / 35
SLIDE 42 Planned structure
- Original idea: Arnu had a sprint for some other work
22 / 35
SLIDE 43 Planned structure
- Original idea: Arnu had a sprint for some other work
- Duration: Two weeks (probably longer, but then you can leave)
22 / 35
SLIDE 44 Planned structure
- Original idea: Arnu had a sprint for some other work
- Duration: Two weeks (probably longer, but then you can leave)
- Two teams
22 / 35
SLIDE 45 Planned structure
- Original idea: Arnu had a sprint for some other work
- Duration: Two weeks (probably longer, but then you can leave)
- Two teams, with tech support (Elan) and someone crying (Herman)
22 / 35
SLIDE 46 Planned structure
- Original idea: Arnu had a sprint for some other work
- Duration: Two weeks (probably longer, but then you can leave)
- Two teams, with tech support (Elan) and someone crying (Herman)
- Compression team:
Arnu, Ryan, Andr´ e, Leanne
Ewald, Benji, Lisa, Avashna
22 / 35
Encoder Discretise FFTNet Decoder z1:N h1:N x1:T ˆ y1:T MFCCs Filterbanks Waveform Vocoder Compression model Symbol-to-speech module Speaker ID Embed
SLIDE 47 Planned structure
- Original idea: Arnu had a sprint for some other work
- Duration: Two weeks (probably longer, but then you can leave)
- Two teams, with tech support (Elan) and someone crying (Herman)
- Compression team:
Arnu, Ryan, Andr´ e, Leanne
Ewald, Benji, Lisa, Avashna
- Teams would work in parallel
22 / 35
Encoder Discretise FFTNet Decoder z1:N h1:N x1:T ˆ y1:T MFCCs Filterbanks Waveform Vocoder Compression model Symbol-to-speech module Speaker ID Embed
SLIDE 48 Planned structure
- Herman talks to team leaders every day
- Daily stand-ups within each of the teams
- Slack: All communication
- Trello: Track tasks with boards (backlog, in-progress, done)
- Bitbucket: Version control using git, pull requests need to be reviewed
23 / 35
SLIDE 49 Promises beforehand: What will you get from this?
24 / 35
SLIDE 50 Promises beforehand: What will you get from this?
- You can leave after two weeks (check with your supervisors)
24 / 35
SLIDE 51 Promises beforehand: What will you get from this?
- You can leave after two weeks (check with your supervisors)
- Have fun
24 / 35
SLIDE 52 Promises beforehand: What will you get from this?
- You can leave after two weeks (check with your supervisors)
- Have fun
- Learn something about speech!
24 / 35
SLIDE 53 Promises beforehand: What will you get from this?
- You can leave after two weeks (check with your supervisors)
- Have fun
- Learn something about speech!
- Learn some software engineering skills
24 / 35
SLIDE 54 Promises beforehand: What will you get from this?
- You can leave after two weeks (check with your supervisors)
- Have fun
- Learn something about speech!
- Learn some software engineering skills . . . maybe
24 / 35
SLIDE 55 Promises beforehand: What will you get from this?
- You can leave after two weeks (check with your supervisors)
- Have fun
- Learn something about speech!
- Learn some software engineering skills . . . maybe
- Do something in a group
24 / 35
SLIDE 56 Promises beforehand: What will you get from this?
- You can leave after two weeks (check with your supervisors)
- Have fun
- Learn something about speech!
- Learn some software engineering skills . . . maybe
- Do something in a group
- Learn where the DSP and MediaLabs are
24 / 35
SLIDE 57 Promises beforehand: What will you get from this?
- You can leave after two weeks (check with your supervisors)
- Have fun
- Learn something about speech!
- Learn some software engineering skills . . . maybe
- Do something in a group
- Learn where the DSP and MediaLabs are
- Maybe a paper
24 / 35
SLIDE 58 Promises beforehand: What will you get from this?
- You can leave after two weeks (check with your supervisors)
- Have fun
- Learn something about speech!
- Learn some software engineering skills . . . maybe
- Do something in a group
- Learn where the DSP and MediaLabs are
- Maybe a paper . . . probably
24 / 35
SLIDE 59 Promises beforehand: What will you get from this?
- You can leave after two weeks (check with your supervisors)
- Have fun
- Learn something about speech!
- Learn some software engineering skills . . . maybe
- Do something in a group
- Learn where the DSP and MediaLabs are
- Maybe a paper . . . probably . . . almost certainly
24 / 35
SLIDE 60 Promises beforehand: What will you get from this?
- You can leave after two weeks (check with your supervisors)
- Have fun
- Learn something about speech!
- Learn some software engineering skills . . . maybe
- Do something in a group
- Learn where the DSP and MediaLabs are
- Maybe a paper . . . probably . . . almost certainly
- Worst case: Pizza and beer
24 / 35
SLIDE 61
SLIDE 62
SLIDE 63 What actually happened
27 / 35
SLIDE 64 What actually happened
27 / 35
SLIDE 65 What actually happened
27 / 35
SLIDE 66 What actually happened
- Two teams ∼
- Herman talks to team leaders every day
27 / 35
SLIDE 67 What actually happened
- Two teams ∼
- Herman talks to team leaders every day
- Daily stand-ups within each of the teams
27 / 35
SLIDE 68 What actually happened
- Two teams ∼
- Herman talks to team leaders every day
- Daily stand-ups within each of the teams
- Slack: All communication
27 / 35
SLIDE 69 What actually happened
- Two teams ∼
- Herman talks to team leaders every day
- Daily stand-ups within each of the teams
- Slack: All communication :(
27 / 35
SLIDE 70 What actually happened
- Two teams ∼
- Herman talks to team leaders every day
- Daily stand-ups within each of the teams
- Slack: All communication :(
- Some people didn’t respond; different time zones complicated things
27 / 35
SLIDE 74 What actually happened
- You can leave after two weeks
31 / 35
SLIDE 75 What actually happened
- You can leave after two weeks
- But for Ryan, Andr´
e, Benji . . .
31 / 35
SLIDE 76 What actually happened
- You can leave after two weeks
- But for Ryan, Andr´
e, Benji . . . almost two months, up to day of submission deadline
31 / 35
SLIDE 77 What actually happened
- You can leave after two weeks
- But for Ryan, Andr´
e, Benji . . . almost two months, up to day of submission deadline :(
31 / 35
SLIDE 78 What actually happened
- You can leave after two weeks
- But for Ryan, Andr´
e, Benji . . . almost two months, up to day of submission deadline :(
31 / 35
SLIDE 79 What actually happened
- You can leave after two weeks
- But for Ryan, Andr´
e, Benji . . . almost two months, up to day of submission deadline :(
31 / 35
SLIDE 80 What actually happened
- You can leave after two weeks
- But for Ryan, Andr´
e, Benji . . . almost two months, up to day of submission deadline :(
- Have fun ∼
- Learn something about speech!
31 / 35
SLIDE 81 What actually happened
- You can leave after two weeks
- But for Ryan, Andr´
e, Benji . . . almost two months, up to day of submission deadline :(
- Have fun ∼
- Learn something about speech!
- Learn some software engineering skills . . . maybe
31 / 35
SLIDE 82 What actually happened
- You can leave after two weeks
- But for Ryan, Andr´
e, Benji . . . almost two months, up to day of submission deadline :(
- Have fun ∼
- Learn something about speech!
- Learn some software engineering skills . . . maybe
- Do something in a group
31 / 35
SLIDE 83 What actually happened
- You can leave after two weeks
- But for Ryan, Andr´
e, Benji . . . almost two months, up to day of submission deadline :(
- Have fun ∼
- Learn something about speech!
- Learn some software engineering skills . . . maybe
- Do something in a group
- Learn where the DSP and MediaLabs are :(
31 / 35
SLIDE 84 What actually happened
- You can leave after two weeks
- But for Ryan, Andr´
e, Benji . . . almost two months, up to day of submission deadline :(
- Have fun ∼
- Learn something about speech!
- Learn some software engineering skills . . . maybe
- Do something in a group
- Learn where the DSP and MediaLabs are :(
- Maybe a paper . . . probably . . . almost certainly
31 / 35
SLIDE 85 What actually happened
- You can leave after two weeks
- But for Ryan, Andr´
e, Benji . . . almost two months, up to day of submission deadline :(
- Have fun ∼
- Learn something about speech!
- Learn some software engineering skills . . . maybe
- Do something in a group
- Learn where the DSP and MediaLabs are :(
- Maybe a paper . . . probably . . . almost certainly
- Pizza and beer
31 / 35
SLIDE 86 What we learned: Things we did that worked
32 / 35
SLIDE 87 What we learned: Things we did that worked
- Planning beforehand (Herman did some prototyping)
32 / 35
SLIDE 88 What we learned: Things we did that worked
- Planning beforehand (Herman did some prototyping)
- Role assignment beforehand
32 / 35
SLIDE 89 What we learned: Things we did that worked
- Planning beforehand (Herman did some prototyping)
- Role assignment beforehand
- Make expectations clear upfront (e.g. authors on paper and order)
32 / 35
SLIDE 90 What we learned: Things we did that worked
- Planning beforehand (Herman did some prototyping)
- Role assignment beforehand
- Make expectations clear upfront (e.g. authors on paper and order)
- Using team leaders to deal with big team
32 / 35
SLIDE 91 What we learned: Things we did that worked
- Planning beforehand (Herman did some prototyping)
- Role assignment beforehand
- Make expectations clear upfront (e.g. authors on paper and order)
- Using team leaders to deal with big team
- Flexible in restructuring things on the fly (based on listening to
recommendations from team)
32 / 35
SLIDE 92 What we learned: Things we did that worked
- Planning beforehand (Herman did some prototyping)
- Role assignment beforehand
- Make expectations clear upfront (e.g. authors on paper and order)
- Using team leaders to deal with big team
- Flexible in restructuring things on the fly (based on listening to
recommendations from team)
32 / 35
SLIDE 93 What we learned: Things we did that worked
- Planning beforehand (Herman did some prototyping)
- Role assignment beforehand
- Make expectations clear upfront (e.g. authors on paper and order)
- Using team leaders to deal with big team
- Flexible in restructuring things on the fly (based on listening to
recommendations from team)
- Pizza and beer (Gino’s delivers)
32 / 35
SLIDE 94 What we learned: Things we did that didn’t work
33 / 35
SLIDE 95 What we learned: Things we did that didn’t work
- Some roles weren’t clear enough (especially for first year masters
students)
33 / 35
SLIDE 96 What we learned: Things we did that didn’t work
- Some roles weren’t clear enough (especially for first year masters
students)
- Some people had other stuff going on in the first two weeks
33 / 35
SLIDE 97 What we learned: Things we did that didn’t work
- Some roles weren’t clear enough (especially for first year masters
students)
- Some people had other stuff going on in the first two weeks
- We focussed on intermediate evaluations which turned out not be
that important in the end
33 / 35
SLIDE 98 What we learned: Things we did that didn’t work
- Some roles weren’t clear enough (especially for first year masters
students)
- Some people had other stuff going on in the first two weeks
- We focussed on intermediate evaluations which turned out not be
that important in the end
- Don’t do this in the first two weeks of Systems and Signals 414
lectures
33 / 35
SLIDE 99 What we learned: What we would do differently
34 / 35
SLIDE 100 What we learned: What we would do differently
- Smaller team: Can’t have stand-ups with 10 people; maybe apply the
- ne-pizza rule
34 / 35
SLIDE 101 What we learned: What we would do differently
- Smaller team: Can’t have stand-ups with 10 people; maybe apply the
- ne-pizza rule
- Every team member should have a specific purpose
34 / 35
SLIDE 102 What we learned: What we would do differently
- Smaller team: Can’t have stand-ups with 10 people; maybe apply the
- ne-pizza rule
- Every team member should have a specific purpose
- Locations: If possible, have everyone in one central place (a lab, not a
small room)
34 / 35
SLIDE 103 What we learned: What we would do differently
- Smaller team: Can’t have stand-ups with 10 people; maybe apply the
- ne-pizza rule
- Every team member should have a specific purpose
- Locations: If possible, have everyone in one central place (a lab, not a
small room)
- Get through the pipeline faster: Idea, model, implement, evaluate
34 / 35
SLIDE 104 Conclusions about sprint
35 / 35
SLIDE 105 Conclusions about sprint
- Frustrating but fun at the same time
35 / 35
SLIDE 106 Conclusions about sprint
- Frustrating but fun at the same time
- (New) students got to know each other
35 / 35
SLIDE 107 Conclusions about sprint
- Frustrating but fun at the same time
- (New) students got to know each other
Low-resource speech and language (LSL)
35 / 35
SLIDE 108 Conclusions about sprint
- Frustrating but fun at the same time
- (New) students got to know each other
Low-resource speech and language (LSL) and Lego group
35 / 35
SLIDE 109 Conclusions about sprint
- Frustrating but fun at the same time
- (New) students got to know each other
Low-resource speech and language (LSL) and Lego group
- Everyone learned something
35 / 35
SLIDE 110 Conclusions about sprint
- Frustrating but fun at the same time
- (New) students got to know each other
Low-resource speech and language (LSL) and Lego group
- Everyone learned something
- Pizza and beer
35 / 35