(Outrageously ) Low-Resource Speech Processing NLP @ Deep Learning - - PowerPoint PPT Presentation

outrageously low resource speech processing
SMART_READER_LITE
LIVE PREVIEW

(Outrageously ) Low-Resource Speech Processing NLP @ Deep Learning - - PowerPoint PPT Presentation

(Outrageously ) Low-Resource Speech Processing NLP @ Deep Learning Indaba, Kenya, 2019 Herman Kamper E&E Engineering, Stellenbosch University, South Africa http://www.kamperh.com/ (Outrageously ) Low-Resource Speech Processing NLP


slide-1
SLIDE 1

(Outrageously∗) Low-Resource Speech Processing

NLP @ Deep Learning Indaba, Kenya, 2019 Herman Kamper

E&E Engineering, Stellenbosch University, South Africa http://www.kamperh.com/

slide-2
SLIDE 2

(Outrageously∗) Low-Resource Speech Processing

NLP @ Deep Learning Indaba, Kenya, 2019 Herman Kamper

E&E Engineering, Stellenbosch University, South Africa http://www.kamperh.com/

∗Title plagiarised from Jade Abbott’s DLI talk

slide-3
SLIDE 3
slide-4
SLIDE 4

Supervised speech recognition

i had to think

  • f

some example speech since speech recognition is really cool

2 / 25

slide-5
SLIDE 5

Unsupervised (“zero-resource”) speech processing

My problem: What can we learn if we do not have any labels?

3 / 25

slide-6
SLIDE 6

Unsupervised (“zero-resource”) speech processing

My problem: What can we learn if we do not have any labels?

3 / 25

slide-7
SLIDE 7

Example: Query-by-example speech search

[Jansen and Van Durme, Interspeech’12] 4 / 25

slide-8
SLIDE 8

Example: Query-by-example speech search

Spoken query:

[Jansen and Van Durme, Interspeech’12] 4 / 25

slide-9
SLIDE 9

Example: Query-by-example speech search

Spoken query:

[Jansen and Van Durme, Interspeech’12] 4 / 25

slide-10
SLIDE 10

Example: Query-by-example speech search

Spoken query:

[Jansen and Van Durme, Interspeech’12] 4 / 25

slide-11
SLIDE 11

Example: Query-by-example speech search

Spoken query:

Useful speech system, not requiring any transcribed speech

[Jansen and Van Durme, Interspeech’12] 4 / 25

slide-12
SLIDE 12

Outrageously low-resource = unsupervised speech processing (outline)

5 / 25

slide-13
SLIDE 13

Outrageously low-resource = unsupervised speech processing (outline)

  • Why is this problem so important?

Will try to convince you that this is (one of) the most fundamental machine learning problems, with real impactful applications

5 / 25

slide-14
SLIDE 14

Outrageously low-resource = unsupervised speech processing (outline)

  • Why is this problem so important?

Will try to convince you that this is (one of) the most fundamental machine learning problems, with real impactful applications

  • What are the key ideas needed to tackle this problem?

Hopefully you will get some useful tools

5 / 25

slide-15
SLIDE 15

Outrageously low-resource = unsupervised speech processing (outline)

  • Why is this problem so important?

Will try to convince you that this is (one of) the most fundamental machine learning problems, with real impactful applications

  • What are the key ideas needed to tackle this problem?

Hopefully you will get some useful tools

  • What is still missing?

What are the open problems and research questions which still need to be solved (according to me)

5 / 25

slide-16
SLIDE 16

Why is this problem so important?

slide-17
SLIDE 17
  • 1. A fundamental machine learning problem

Problems in unsupervised speech processing:

7 / 25

slide-18
SLIDE 18
  • 1. A fundamental machine learning problem

Problems in unsupervised speech processing:

  • Learning useful representations from unlabelled speech
  • Segmenting, clustering and discovering longer-spanning

(word- or phrase-like) patterns

7 / 25

slide-19
SLIDE 19
  • 1. A fundamental machine learning problem

Problems in unsupervised speech processing:

  • Learning useful representations from unlabelled speech
  • Segmenting, clustering and discovering longer-spanning

(word- or phrase-like) patterns

  • Combined problem of perception, structure, continuous

and discrete variables

7 / 25

slide-20
SLIDE 20
  • 1. A fundamental machine learning problem

Problems in unsupervised speech processing:

  • Learning useful representations from unlabelled speech
  • Segmenting, clustering and discovering longer-spanning

(word- or phrase-like) patterns

  • Combined problem of perception, structure, continuous

and discrete variables “The goal of machine learning is to develop methods that can automatically detect patterns in data . . . ”

— Murphy

“Extract important patterns and trends, and understand ‘what the data says’ . . . ”

— Hastie, Tibshirani, Friedman

“The problem of searching for patterns in data is . . . fundamental . . . ”

— Bishop

7 / 25

slide-21
SLIDE 21
  • 1. A fundamental machine learning problem

Problems in unsupervised speech processing:

  • Learning useful representations from unlabelled speech
  • Segmenting, clustering and discovering longer-spanning

(word- or phrase-like) patterns

  • Combined problem of perception, structure, continuous

and discrete variables “The goal of machine learning is to develop methods that can automatically detect patterns in data . . . ”

— Murphy

“Extract important patterns and trends, and understand ‘what the data says’ . . . ”

— Hastie, Tibshirani, Friedman

“The problem of searching for patterns in data is . . . fundamental . . . ”

— Bishop

7 / 25

slide-22
SLIDE 22
  • 2. Universal speech technology

“Imagine a world in which every single human being can freely share in the sum of all knowledge.”

8 / 25

slide-23
SLIDE 23
  • 2. Universal speech technology

“Imagine a world in which every single human being can freely share in the sum of all knowledge.”

— Mission statement stolen from Laura Martinus

8 / 25

slide-24
SLIDE 24
  • 2. Universal speech technology

“Imagine a world in which every single human being can freely share in the sum of all knowledge.”

— Mission statement stolen from Laura Martinus — Who stole it from the Wikimedia Foundation

https://15.wikipedia.org/endowment.html 8 / 25

slide-25
SLIDE 25
  • 2. Universal speech technology

UN Pulse Lab, Kampala https://www.kpvu.org/post/turn-tune-transcribe-un-develops-radio-listening-tool 9 / 25

slide-26
SLIDE 26
  • 2. Universal speech technology

Existing System Proposed System

PREPROCESS ASR KEYWORD SEARCH HUMAN ANALYSTS DATABASE

Speech Live radio stream Lattices Keywords, timing, probs

KEYWORD SPOTTER CNN-DTW

[Saeb et al., 2017; Menon et al., 2018] 10 / 25

slide-27
SLIDE 27
  • 2. Universal speech technology

[Renkens, PhD’18] 11 / 25

slide-28
SLIDE 28
  • 2. Universal speech technology

Linguistic and cultural documentation and preservation:

http://www.stevenbird.net/ 12 / 25

slide-29
SLIDE 29
  • 2. Universal speech technology

http://www.stevenbird.net/ 13 / 25

slide-30
SLIDE 30
  • 3. Understanding human language acquisition
  • Cognitive modelling: Try to uncover learning mechanisms in humans
  • A model of human language acquisition: Can probe easily
  • Example applications:

— Identify hearing loss early — Predict learning difficulties — How much do we need to talk to infants?

https://bergelsonlab.com/seedlings/ 14 / 25

slide-31
SLIDE 31

Three ideas to tackle these problems

slide-32
SLIDE 32
  • 1. Build in the (domain) knowledge we have

16 / 25

slide-33
SLIDE 33
  • 1. Build in the (domain) knowledge we have
  • Pushing the model in a direction: inductive bias, Bayesian priors,

regularisation, data augmentation

16 / 25

slide-34
SLIDE 34
  • 1. Build in the (domain) knowledge we have
  • Pushing the model in a direction: inductive bias, Bayesian priors,

regularisation, data augmentation

  • In unsupervised learning this is all we have

16 / 25

slide-35
SLIDE 35
  • 1. Build in the (domain) knowledge we have
  • Pushing the model in a direction: inductive bias, Bayesian priors,

regularisation, data augmentation

  • In unsupervised learning this is all we have
  • We know a lot about languages in general

16 / 25

slide-36
SLIDE 36
  • 1. Build in the (domain) knowledge we have
  • Pushing the model in a direction: inductive bias, Bayesian priors,

regularisation, data augmentation

  • In unsupervised learning this is all we have
  • We know a lot about languages in general
  • Example: Although speech sounds are produced differently in different

languages, there are aspects which are shared

16 / 25

slide-37
SLIDE 37
  • 1. Build in the (domain) knowledge we have

Share representations across languages:

MFCC + i-vector Hidden layers BNF layer Korean French German

[Hermann and Goldwater, 2018; Hermann et al., 2018; https://arxiv.org/abs/1811.04791] 17 / 25

slide-38
SLIDE 38
  • 1. Build in the (domain) knowledge we have

35 45 55 65 75 Average precision (%) ES HA HR 40 80 120 160 200 35 45 55 65 75 Hours of data Average precision (%) SV 40 80 120 160 200 Hours of data TR 40 80 120 160 200 Hours of data ZH BNF 1 BNF 2 EN cAE UTD cAE gold

[Hermann and Goldwater, 2018; Hermann et al., 2018; https://arxiv.org/abs/1811.04791] 18 / 25

slide-39
SLIDE 39
  • 2. Compression

19 / 25

slide-40
SLIDE 40
  • 2. Compression

Autoencoder:

h x ˆ x

Loss for single training example: J = ||x − ˆ x||2

19 / 25

slide-41
SLIDE 41
  • 2. Compression

Vector-quantised variational autoencoder (VQ-VAE):

h z select closest x ˆ x e1 · · · eK Codebook

z = ek where k = argminK

j=1||h − ej||2

J = α||x − ˆ x||2 + ||sg(h) − ek||2 + β||h − sg(ek)||2

20 / 25

slide-42
SLIDE 42
  • 2. Compression: An example from our group

Benjamin van Niekerk Andr´ e Nortje

Language Input Synthesised output English

Play Play

Indonesian

Play Play

https://arxiv.org/abs/1904.07556 21 / 25

Encoder Discretise FFTNet Decoder z1:N h1:N x1:T ˆ y1:T MFCCs Filterbanks Waveform Vocoder Compression model Symbol-to-speech module Speaker ID Embed

slide-43
SLIDE 43
  • 3. Learning from multiple modalities

22 / 25

slide-44
SLIDE 44
  • 3. Learning from multiple modalities

X VGG

max conv max feedfwd d(yvis, yspch)

distance yvis yspch

[Harwath et al., NeurIPS’16] 22 / 25

slide-45
SLIDE 45
  • 3. Learning from multiple modalities

One-shot multimodal learning and matching:

23 / 25

Ryan Eloff Leanne Nortje

slide-46
SLIDE 46
  • 3. Learning from multiple modalities

One-shot multimodal learning and matching:

Query: (two) Support set Multimodal one-shot learning Multimodal one-shot matching

[Eloff et al., ICASSP’19; https://arxiv.org/abs/1811.03875] 23 / 25

Ryan Eloff Leanne Nortje

slide-47
SLIDE 47
  • 3. Learning from multiple modalities

One-shot multimodal learning and matching:

Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching

[Eloff et al., ICASSP’19; https://arxiv.org/abs/1811.03875] 23 / 25

Ryan Eloff Leanne Nortje

slide-48
SLIDE 48
  • 3. Learning from multiple modalities

One-shot multimodal learning and matching:

Query: (two)

?

Support set Matching set Multimodal one-shot learning Multimodal one-shot matching

[Eloff et al., ICASSP’19; https://arxiv.org/abs/1811.03875] 23 / 25

Ryan Eloff Leanne Nortje

slide-49
SLIDE 49

The most important missing parts

slide-50
SLIDE 50

What I think is still missing

25 / 25

slide-51
SLIDE 51

What I think is still missing

  • Engineering/technical: Generic ways to incorporate domain knowledge

25 / 25

slide-52
SLIDE 52

What I think is still missing

  • Engineering/technical: Generic ways to incorporate domain knowledge
  • Scientific: What are the mechanisms used for learning language?

25 / 25

slide-53
SLIDE 53

What I think is still missing

  • Engineering/technical: Generic ways to incorporate domain knowledge
  • Scientific: What are the mechanisms used for learning language?
  • What are useful, practical applications that we should be working on?

25 / 25

slide-54
SLIDE 54

What I think is still missing

  • Engineering/technical: Generic ways to incorporate domain knowledge
  • Scientific: What are the mechanisms used for learning language?
  • What are useful, practical applications that we should be working on?

(Instead of just spending time in the shower)

25 / 25

slide-55
SLIDE 55

What I think is still missing

  • Engineering/technical: Generic ways to incorporate domain knowledge
  • Scientific: What are the mechanisms used for learning language?
  • What are useful, practical applications that we should be working on?

(Instead of just spending time in the shower)

  • Real test cases on real low-resource languages

25 / 25

slide-56
SLIDE 56

What I think is still missing

  • Engineering/technical: Generic ways to incorporate domain knowledge
  • Scientific: What are the mechanisms used for learning language?
  • What are useful, practical applications that we should be working on?

(Instead of just spending time in the shower)

  • Real test cases on real low-resource languages

“. . . while the authors did make an effort to artificially limit the data availability, I don’t think the main claims of the paper . . . is generalizable to actual low-resource languages . . . ”

— Reviewer

25 / 25

slide-57
SLIDE 57

What I think is still missing

  • Engineering/technical: Generic ways to incorporate domain knowledge
  • Scientific: What are the mechanisms used for learning language?
  • What are useful, practical applications that we should be working on?

(Instead of just spending time in the shower)

  • Real test cases on real low-resource languages

“. . . while the authors did make an effort to artificially limit the data availability, I don’t think the main claims of the paper . . . is generalizable to actual low-resource languages . . . ”

— Reviewer

  • Getting data for these test cases

25 / 25

slide-58
SLIDE 58

http://www.kamperh.com/ https://github.com/kamperh/

slide-59
SLIDE 59

Compression: Autoencoder

h x ˆ x

Loss for single training example: J = ||x − ˆ x||2

27 / 25

slide-60
SLIDE 60

Vector-quantised variational autoencoder (VQ-VAE)

h z select closest x ˆ x e1 · · · eK Codebook

z = ek where k = argminK

j=1||h − ej||2 28 / 25

slide-61
SLIDE 61

Vector-quantised variational autoencoder (VQ-VAE)

  • Loss for single training example:

J = − log p(x|z) + ||sg(h) − z||2 + β||h − sg(z)||2

  • Assuming spherical Gaussian output:

J = α||x − ˆ x||2 + ||sg(h) − z||2 + β||h − sg(z)||2

  • Explicitly denoting selected embedding:

J = α||x − ˆ x||2 + ||sg(h) − ek||2 + β||h − sg(ek)||2

  • ||x − ˆ

x||2 is the reconstruction loss

  • ||sg(h) − ek||2 updates the embedding codebook, with sg denoting

the stop-gradient

  • ||h − sg(ek)||2 is the commitment loss which encourages the encoder
  • utput h to lie close to the selected codebook embedding ek

29 / 25

slide-62
SLIDE 62

Vector-quantised variational autoencoder (VQ-VAE)

h z select closest x ˆ x e1 · · · eK Codebook

z = ek where k = argminK

j=1||h − ej||2

J = α||x − ˆ x||2 + ||sg(h) − ek||2 + β||h − sg(ek)||2

30 / 25

slide-63
SLIDE 63

Vector-quantised variational autoencoder (VQ-VAE)

  • Quantisation in VQ-VAE:

z = ek where k = argminK

j=1||h − ej||2

  • For backpropagation we need: ∂J

∂h

  • Chain rule: ∂J

∂h = ∂z ∂h ∂J ∂z

  • What is ∂z

∂h with z = closest(e1, . . . , eK)? Cannot solve directly

  • Idea: If z ≈ h then we could use ∂J

∂h ≈ ∂J ∂z

  • ||sg(h) − ek||2 + β||h − sg(ek)||2 adds incentive for z ≈ h

31 / 25

h z select closest e1 · · · eK Codebook

slide-64
SLIDE 64

Vector-quantised variational autoencoder (VQ-VAE)

  • So, why not just use J = ||x − ˆ

x||2?

  • Then there is no incentive for z ≈ h
  • Why not just add ||h − z||2?
  • Might want to update h and the selected

embedding z = ek at different rates

  • I.e., might still want h to sometimes pick different embeddings in the

codebook so that these get updated (think about how we add noise in standard STE)

  • Answer to both above questions: it works better

32 / 25

h z select closest e1 · · · eK Codebook