[PPT] - (Outrageously ) Low-Resource Speech Processing NLP @ Deep Learning PowerPoint Presentation

SLIDE 1

(Outrageously∗) Low-Resource Speech Processing

NLP @ Deep Learning Indaba, Kenya, 2019 Herman Kamper

E&E Engineering, Stellenbosch University, South Africa http://www.kamperh.com/

SLIDE 2

(Outrageously∗) Low-Resource Speech Processing

NLP @ Deep Learning Indaba, Kenya, 2019 Herman Kamper

E&E Engineering, Stellenbosch University, South Africa http://www.kamperh.com/

∗Title plagiarised from Jade Abbott’s DLI talk

SLIDE 3

SLIDE 4

Supervised speech recognition

i had to think

f

some example speech since speech recognition is really cool

2 / 25

SLIDE 5

Unsupervised (“zero-resource”) speech processing

My problem: What can we learn if we do not have any labels?

3 / 25

SLIDE 6

Unsupervised (“zero-resource”) speech processing

My problem: What can we learn if we do not have any labels?

3 / 25

SLIDE 7

Example: Query-by-example speech search

[Jansen and Van Durme, Interspeech’12] 4 / 25

SLIDE 8

Example: Query-by-example speech search

Spoken query:

[Jansen and Van Durme, Interspeech’12] 4 / 25

SLIDE 9

Example: Query-by-example speech search

Spoken query:

[Jansen and Van Durme, Interspeech’12] 4 / 25

SLIDE 10

Example: Query-by-example speech search

Spoken query:

[Jansen and Van Durme, Interspeech’12] 4 / 25

SLIDE 11

Example: Query-by-example speech search

Spoken query:

Useful speech system, not requiring any transcribed speech

[Jansen and Van Durme, Interspeech’12] 4 / 25

SLIDE 12

Outrageously low-resource = unsupervised speech processing (outline)

5 / 25

SLIDE 13

Outrageously low-resource = unsupervised speech processing (outline)

Why is this problem so important?

Will try to convince you that this is (one of) the most fundamental machine learning problems, with real impactful applications

5 / 25

SLIDE 14

Outrageously low-resource = unsupervised speech processing (outline)

Why is this problem so important?

Will try to convince you that this is (one of) the most fundamental machine learning problems, with real impactful applications

What are the key ideas needed to tackle this problem?

Hopefully you will get some useful tools

5 / 25

SLIDE 15

Outrageously low-resource = unsupervised speech processing (outline)

Why is this problem so important?

Will try to convince you that this is (one of) the most fundamental machine learning problems, with real impactful applications

What are the key ideas needed to tackle this problem?

Hopefully you will get some useful tools

What is still missing?

What are the open problems and research questions which still need to be solved (according to me)

5 / 25

SLIDE 16

Why is this problem so important?

SLIDE 17

1. A fundamental machine learning problem

Problems in unsupervised speech processing:

7 / 25

SLIDE 18

1. A fundamental machine learning problem

Problems in unsupervised speech processing:

Learning useful representations from unlabelled speech
Segmenting, clustering and discovering longer-spanning

(word- or phrase-like) patterns

7 / 25

SLIDE 19

1. A fundamental machine learning problem

Problems in unsupervised speech processing:

Learning useful representations from unlabelled speech
Segmenting, clustering and discovering longer-spanning

(word- or phrase-like) patterns

Combined problem of perception, structure, continuous

and discrete variables

7 / 25

SLIDE 20

1. A fundamental machine learning problem

Problems in unsupervised speech processing:

Learning useful representations from unlabelled speech
Segmenting, clustering and discovering longer-spanning

(word- or phrase-like) patterns

Combined problem of perception, structure, continuous

and discrete variables “The goal of machine learning is to develop methods that can automatically detect patterns in data . . . ”

— Murphy

“Extract important patterns and trends, and understand ‘what the data says’ . . . ”

— Hastie, Tibshirani, Friedman

“The problem of searching for patterns in data is . . . fundamental . . . ”

— Bishop

7 / 25

SLIDE 21

1. A fundamental machine learning problem

Problems in unsupervised speech processing:

Learning useful representations from unlabelled speech
Segmenting, clustering and discovering longer-spanning

(word- or phrase-like) patterns

Combined problem of perception, structure, continuous

and discrete variables “The goal of machine learning is to develop methods that can automatically detect patterns in data . . . ”

— Murphy

“Extract important patterns and trends, and understand ‘what the data says’ . . . ”

— Hastie, Tibshirani, Friedman

“The problem of searching for patterns in data is . . . fundamental . . . ”

— Bishop

7 / 25

SLIDE 22

2. Universal speech technology

“Imagine a world in which every single human being can freely share in the sum of all knowledge.”

8 / 25

SLIDE 23

2. Universal speech technology

“Imagine a world in which every single human being can freely share in the sum of all knowledge.”

— Mission statement stolen from Laura Martinus

8 / 25

SLIDE 24

2. Universal speech technology

“Imagine a world in which every single human being can freely share in the sum of all knowledge.”

— Mission statement stolen from Laura Martinus — Who stole it from the Wikimedia Foundation

https://15.wikipedia.org/endowment.html 8 / 25

SLIDE 25

2. Universal speech technology

UN Pulse Lab, Kampala https://www.kpvu.org/post/turn-tune-transcribe-un-develops-radio-listening-tool 9 / 25

SLIDE 26

2. Universal speech technology

Existing System Proposed System

PREPROCESS ASR KEYWORD SEARCH HUMAN ANALYSTS DATABASE

Speech Live radio stream Lattices Keywords, timing, probs

KEYWORD SPOTTER CNN-DTW

[Saeb et al., 2017; Menon et al., 2018] 10 / 25

SLIDE 27

2. Universal speech technology

[Renkens, PhD’18] 11 / 25

SLIDE 28

2. Universal speech technology

Linguistic and cultural documentation and preservation:

http://www.stevenbird.net/ 12 / 25

SLIDE 29

2. Universal speech technology

http://www.stevenbird.net/ 13 / 25

SLIDE 30

3. Understanding human language acquisition
Cognitive modelling: Try to uncover learning mechanisms in humans
A model of human language acquisition: Can probe easily
Example applications:

— Identify hearing loss early — Predict learning difficulties — How much do we need to talk to infants?

https://bergelsonlab.com/seedlings/ 14 / 25

SLIDE 31

Three ideas to tackle these problems

SLIDE 32

1. Build in the (domain) knowledge we have

16 / 25

SLIDE 33

1. Build in the (domain) knowledge we have
Pushing the model in a direction: inductive bias, Bayesian priors,

regularisation, data augmentation

16 / 25

SLIDE 34

1. Build in the (domain) knowledge we have
Pushing the model in a direction: inductive bias, Bayesian priors,

regularisation, data augmentation

In unsupervised learning this is all we have

16 / 25

SLIDE 35

1. Build in the (domain) knowledge we have
Pushing the model in a direction: inductive bias, Bayesian priors,

regularisation, data augmentation

In unsupervised learning this is all we have
We know a lot about languages in general

16 / 25

SLIDE 36

1. Build in the (domain) knowledge we have
Pushing the model in a direction: inductive bias, Bayesian priors,

regularisation, data augmentation

In unsupervised learning this is all we have
We know a lot about languages in general
Example: Although speech sounds are produced differently in different

languages, there are aspects which are shared

16 / 25

SLIDE 37

1. Build in the (domain) knowledge we have

Share representations across languages:

MFCC + i-vector Hidden layers BNF layer Korean French German

[Hermann and Goldwater, 2018; Hermann et al., 2018; https://arxiv.org/abs/1811.04791] 17 / 25

SLIDE 38

1. Build in the (domain) knowledge we have

35 45 55 65 75 Average precision (%) ES HA HR 40 80 120 160 200 35 45 55 65 75 Hours of data Average precision (%) SV 40 80 120 160 200 Hours of data TR 40 80 120 160 200 Hours of data ZH BNF 1 BNF 2 EN cAE UTD cAE gold

[Hermann and Goldwater, 2018; Hermann et al., 2018; https://arxiv.org/abs/1811.04791] 18 / 25

SLIDE 39

2. Compression

19 / 25

SLIDE 40

2. Compression

Autoencoder:

h x ˆ x

Loss for single training example: J = ||x − ˆ x||2

19 / 25

SLIDE 41

2. Compression

Vector-quantised variational autoencoder (VQ-VAE):

h z select closest x ˆ x e1 · · · eK Codebook

z = ek where k = argminK

j=1||h − ej||2

J = α||x − ˆ x||2 + ||sg(h) − ek||2 + β||h − sg(ek)||2

20 / 25

SLIDE 42

2. Compression: An example from our group

Benjamin van Niekerk Andr´ e Nortje

Language Input Synthesised output English

Play Play

Indonesian

Play Play

https://arxiv.org/abs/1904.07556 21 / 25

Encoder Discretise FFTNet Decoder z1:N h1:N x1:T ˆ y1:T MFCCs Filterbanks Waveform Vocoder Compression model Symbol-to-speech module Speaker ID Embed

SLIDE 43

3. Learning from multiple modalities

22 / 25

SLIDE 44

3. Learning from multiple modalities

X VGG

max conv max feedfwd d(yvis, yspch)

distance yvis yspch

[Harwath et al., NeurIPS’16] 22 / 25

SLIDE 45

3. Learning from multiple modalities

One-shot multimodal learning and matching:

23 / 25

Ryan Eloff Leanne Nortje

SLIDE 46

3. Learning from multiple modalities

One-shot multimodal learning and matching:

Query: (two) Support set Multimodal one-shot learning Multimodal one-shot matching

[Eloff et al., ICASSP’19; https://arxiv.org/abs/1811.03875] 23 / 25

Ryan Eloff Leanne Nortje

SLIDE 47

3. Learning from multiple modalities

One-shot multimodal learning and matching:

Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching

[Eloff et al., ICASSP’19; https://arxiv.org/abs/1811.03875] 23 / 25

Ryan Eloff Leanne Nortje

SLIDE 48

3. Learning from multiple modalities

One-shot multimodal learning and matching:

Query: (two)

?

Support set Matching set Multimodal one-shot learning Multimodal one-shot matching

[Eloff et al., ICASSP’19; https://arxiv.org/abs/1811.03875] 23 / 25

Ryan Eloff Leanne Nortje

SLIDE 49

The most important missing parts

SLIDE 50

What I think is still missing

25 / 25

SLIDE 51

What I think is still missing

Engineering/technical: Generic ways to incorporate domain knowledge

25 / 25

SLIDE 52

What I think is still missing

Engineering/technical: Generic ways to incorporate domain knowledge
Scientific: What are the mechanisms used for learning language?

25 / 25

SLIDE 53

What I think is still missing

Engineering/technical: Generic ways to incorporate domain knowledge
Scientific: What are the mechanisms used for learning language?
What are useful, practical applications that we should be working on?

25 / 25

SLIDE 54

What I think is still missing

Engineering/technical: Generic ways to incorporate domain knowledge
Scientific: What are the mechanisms used for learning language?
What are useful, practical applications that we should be working on?

(Instead of just spending time in the shower)

25 / 25

SLIDE 55

What I think is still missing

Engineering/technical: Generic ways to incorporate domain knowledge
Scientific: What are the mechanisms used for learning language?
What are useful, practical applications that we should be working on?

(Instead of just spending time in the shower)

Real test cases on real low-resource languages

25 / 25

SLIDE 56

What I think is still missing

Engineering/technical: Generic ways to incorporate domain knowledge
Scientific: What are the mechanisms used for learning language?
What are useful, practical applications that we should be working on?

(Instead of just spending time in the shower)

Real test cases on real low-resource languages

“. . . while the authors did make an effort to artificially limit the data availability, I don’t think the main claims of the paper . . . is generalizable to actual low-resource languages . . . ”

— Reviewer

25 / 25

SLIDE 57

What I think is still missing

Engineering/technical: Generic ways to incorporate domain knowledge
Scientific: What are the mechanisms used for learning language?
What are useful, practical applications that we should be working on?

(Instead of just spending time in the shower)

Real test cases on real low-resource languages

“. . . while the authors did make an effort to artificially limit the data availability, I don’t think the main claims of the paper . . . is generalizable to actual low-resource languages . . . ”

— Reviewer

Getting data for these test cases

25 / 25

SLIDE 58

http://www.kamperh.com/ https://github.com/kamperh/

SLIDE 59

Compression: Autoencoder

h x ˆ x

Loss for single training example: J = ||x − ˆ x||2

27 / 25

SLIDE 60

Vector-quantised variational autoencoder (VQ-VAE)

h z select closest x ˆ x e1 · · · eK Codebook

z = ek where k = argminK

j=1||h − ej||2 28 / 25

SLIDE 61

Vector-quantised variational autoencoder (VQ-VAE)

Loss for single training example:

J = − log p(x|z) + ||sg(h) − z||2 + β||h − sg(z)||2

Assuming spherical Gaussian output:

J = α||x − ˆ x||2 + ||sg(h) − z||2 + β||h − sg(z)||2

Explicitly denoting selected embedding:

J = α||x − ˆ x||2 + ||sg(h) − ek||2 + β||h − sg(ek)||2

||x − ˆ

x||2 is the reconstruction loss

||sg(h) − ek||2 updates the embedding codebook, with sg denoting

the stop-gradient

||h − sg(ek)||2 is the commitment loss which encourages the encoder
utput h to lie close to the selected codebook embedding ek

29 / 25

SLIDE 62

Vector-quantised variational autoencoder (VQ-VAE)

h z select closest x ˆ x e1 · · · eK Codebook

z = ek where k = argminK

j=1||h − ej||2

J = α||x − ˆ x||2 + ||sg(h) − ek||2 + β||h − sg(ek)||2

30 / 25

SLIDE 63

Vector-quantised variational autoencoder (VQ-VAE)

Quantisation in VQ-VAE:

z = ek where k = argminK

j=1||h − ej||2

For backpropagation we need: ∂J

∂h

Chain rule: ∂J

∂h = ∂z ∂h ∂J ∂z

What is ∂z

∂h with z = closest(e1, . . . , eK)? Cannot solve directly

Idea: If z ≈ h then we could use ∂J

∂h ≈ ∂J ∂z

||sg(h) − ek||2 + β||h − sg(ek)||2 adds incentive for z ≈ h

31 / 25

h z select closest e1 · · · eK Codebook

SLIDE 64

Vector-quantised variational autoencoder (VQ-VAE)

So, why not just use J = ||x − ˆ

x||2?

Then there is no incentive for z ≈ h
Why not just add ||h − z||2?
Might want to update h and the selected

embedding z = ek at different rates

I.e., might still want h to sometimes pick different embeddings in the

codebook so that these get updated (think about how we add noise in standard STE)

Answer to both above questions: it works better

32 / 25