Unsupervised neural feature learning for speech using weak top-down - - PowerPoint PPT Presentation

unsupervised neural feature learning for speech using
SMART_READER_LITE
LIVE PREVIEW

Unsupervised neural feature learning for speech using weak top-down - - PowerPoint PPT Presentation

Unsupervised neural feature learning for speech using weak top-down constraints Maties Machine Learning (MML), Oct. 2017 Herman Kamper Stellenbosch University http://www.kamperh.com/ Success in speech recognition 1 / 18 Success in speech


slide-1
SLIDE 1

Unsupervised neural feature learning for speech using weak top-down constraints

Maties Machine Learning (MML), Oct. 2017 Herman Kamper

Stellenbosch University http://www.kamperh.com/

slide-2
SLIDE 2

Success in speech recognition

1 / 18

slide-3
SLIDE 3

Success in speech recognition

1 / 18

slide-4
SLIDE 4

Success in speech recognition

1 / 18

slide-5
SLIDE 5

Success in speech recognition

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

1 / 18

slide-6
SLIDE 6

Success in speech recognition

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

1 / 18

slide-7
SLIDE 7

Success in speech recognition

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

  • Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)

1 / 18

slide-8
SLIDE 8

Success in speech recognition

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

  • Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)
  • An addiction to labels: 2000 hours transcribed speech audio;

∼350M/560M words text

1 / 18

slide-9
SLIDE 9

Success in speech recognition

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

  • Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)
  • An addiction to labels: 2000 hours transcribed speech audio;

∼350M/560M words text

1 / 18

i had to think

  • f

some example speech since speech recognition is really cool

slide-10
SLIDE 10

Success in speech recognition

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

  • Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)
  • An addiction to labels: 2000 hours transcribed speech audio;

∼350M/560M words text

  • But, there are around 7000 languages spoken in the world today

1 / 18

i had to think

  • f

some example speech since speech recognition is really cool

slide-11
SLIDE 11
slide-12
SLIDE 12

Why learn without labels?

3 / 18

slide-13
SLIDE 13

Why learn without labels?

  • Get insight into human language acquisition [R¨

as¨ anen and Rasilo, ’15]

3 / 18

slide-14
SLIDE 14

Why learn without labels?

  • Get insight into human language acquisition [R¨

as¨ anen and Rasilo, ’15]

  • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]

3 / 18

slide-15
SLIDE 15

Why learn without labels?

  • Get insight into human language acquisition [R¨

as¨ anen and Rasilo, ’15]

  • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
  • Analysis of audio for unwritten languages [Besacier et al., ’14]

3 / 18

slide-16
SLIDE 16

Why learn without labels?

  • Get insight into human language acquisition [R¨

as¨ anen and Rasilo, ’15]

  • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
  • Analysis of audio for unwritten languages [Besacier et al., ’14]
  • New insights and models for speech processing

[Jansen et al., ’13]

3 / 18

slide-17
SLIDE 17

Unsupervised term discovery (UTD)

[Park and Glass, TASLP’08] 4 / 18

slide-18
SLIDE 18

Unsupervised term discovery (UTD)

[Park and Glass, TASLP’08] 4 / 18

slide-19
SLIDE 19

Unsupervised term discovery (UTD)

[Park and Glass, TASLP’08] 4 / 18

slide-20
SLIDE 20

Unsupervised term discovery (UTD)

[Park and Glass, TASLP’08] 4 / 18

slide-21
SLIDE 21

Example: Query-by-example search

[Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]

5 / 18

slide-22
SLIDE 22

Example: Query-by-example search

Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]

5 / 18

slide-23
SLIDE 23

Example: Query-by-example search

Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]

5 / 18

slide-24
SLIDE 24

Example: Query-by-example search

Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]

5 / 18

slide-25
SLIDE 25

Example: Query-by-example search

Spoken query:

Useful speech system, not requiring any transcribed speech

[Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]

5 / 18

slide-26
SLIDE 26

Unsupervised speech processing: Two problems

  • 1. Unsupervised frame-level representation learning:

6 / 18

slide-27
SLIDE 27

Unsupervised speech processing: Two problems

  • 1. Unsupervised frame-level representation learning:

6 / 18

slide-28
SLIDE 28

Unsupervised speech processing: Two problems

  • 1. Unsupervised frame-level representation learning:

fa(·)

6 / 18

slide-29
SLIDE 29

Unsupervised speech processing: Two problems

  • 1. Unsupervised frame-level representation learning:

fa(·) fa(·) Cool model

6 / 18

slide-30
SLIDE 30

Unsupervised speech processing: Two problems

  • 1. Unsupervised frame-level representation learning:

fa(·) fa(·) Cool model

  • 2. Unsupervised segmentation and clustering:

How do we discover meaningful units in unlabelled speech?

6 / 18

slide-31
SLIDE 31

Unsupervised frame-level representation learning:

The Correspondence Autoencoder

slide-32
SLIDE 32

Unsupervised frame-level representation learning:

The Correspondence Autoencoder

Micha Elsner Daniel Renshaw Aren Jansen Sharon Goldwater

slide-33
SLIDE 33

Supervised representation learning using DNNs

ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks Output: predict phone states

8 / 18

slide-34
SLIDE 34

Supervised representation learning using DNNs

ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks Output: predict phone states Feature extractor fa(·) learned from data Phone classifier learned jointly

8 / 18

slide-35
SLIDE 35

Supervised representation learning using DNNs

ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks Output: predict phone states Feature extractor fa(·) learned from data Phone classifier learned jointly

Unsupervised modelling: No phone class targets to train network on

8 / 18

slide-36
SLIDE 36

Autoencoder (AE) neural network

Input speech frame Reconstruct input

[Badino et al., ICASSP’14] 9 / 18

slide-37
SLIDE 37

Autoencoder (AE) neural network

Input speech frame Reconstruct input

[Badino et al., ICASSP’14]

  • Completely unsupervised
  • But purely bottom-up
  • Can we use top-down information?

9 / 18

slide-38
SLIDE 38

Autoencoder (AE) neural network

Input speech frame Reconstruct input

[Badino et al., ICASSP’14]

  • Completely unsupervised
  • But purely bottom-up
  • Can we use top-down information?
  • Idea: Unsupervised term discovery

9 / 18

slide-39
SLIDE 39

Unsupervised term discovery (UTD)

10 / 18

slide-40
SLIDE 40

Unsupervised term discovery (UTD)

Can we use these discovered word pairs to give weak top-down supervision?

10 / 18

slide-41
SLIDE 41

Weak top-down supervision: Align frames

[Jansen et al., ICASSP’13] 11 / 18

slide-42
SLIDE 42

Weak top-down supervision: Align frames

[Jansen et al., ICASSP’13] 11 / 18

slide-43
SLIDE 43

Weak top-down supervision: Align frames

[Jansen et al., ICASSP’13] 11 / 18

slide-44
SLIDE 44

Autoencoder (AE)

Input speech frame Reconstruct input

12 / 18

slide-45
SLIDE 45

Correspondence autoencoder (cAE)

Frame from one word Frame from other word in pair

13 / 18

slide-46
SLIDE 46

Correspondence autoencoder (cAE)

Frame from one word Unsupervised feature extractor fa(·) Frame from other word in pair

13 / 18

slide-47
SLIDE 47

Correspondence autoencoder (cAE)

Frame from one word Unsupervised feature extractor fa(·) Frame from other word in pair

Combine top-down and bottom-up information

13 / 18

slide-48
SLIDE 48

Correspondence autoencoder (cAE)

Frame from one word Unsupervised feature extractor Frame from other word in pair

Play Play

14 / 18

slide-49
SLIDE 49

Correspondence autoencoder (cAE)

Speech corpus Initialize weights Train stacked autoencoder (pretraining) Align word pair frames Train correspondence autoencoder (1) (2) (3) (4) Unsupervised term discovery Unsupervised feature extractor

[Kamper et al., ICASSP’15] 15 / 18

slide-50
SLIDE 50

Evaluation: Query-by-example search

Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]

16 / 18

slide-51
SLIDE 51

Evaluation: Isolated word query-by-example

Autoencoder UBM-GMM TopUBM cAE 0.0 0.1 0.2 0.3 0.4 0.5 Average precision

17 / 18

slide-52
SLIDE 52

Evaluation: Isolated word query-by-example

Autoencoder UBM-GMM TopUBM cAE 0.0 0.1 0.2 0.3 0.4 0.5 Average precision Extended: [Renshaw et al., IS’15] and [Yuan et al., IS’16]

17 / 18

slide-53
SLIDE 53

Summary and conclusion

  • Introduced correspondence autoencoder (cAE) for unsupervised

frame-level representation learning

  • Uses top-down information from unsupervised term discovery system
  • Uses bottom-up initialization on large speech corpus
  • Unsupervised neural network model that combines top-down and

bottom-up information results in large intrinsic improvements

  • Links with language acquisition research
  • Future: More analysis; different domains; practical search systems

18 / 18

slide-54
SLIDE 54

http://www.kamperh.com/ https://github.com/kamperh

slide-55
SLIDE 55

Evaluation of features: same-different task

slide-56
SLIDE 56

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like”

slide-57
SLIDE 57

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “apple” Treat as query

slide-58
SLIDE 58

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “apple” Treat as query “pie” “grape” “apple” “apple” “like” Treat as terms to search

slide-59
SLIDE 59

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple”

slide-60
SLIDE 60

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple”

slide-61
SLIDE 61

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” DTW distance: d1

slide-62
SLIDE 62

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1

slide-63
SLIDE 63

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1

slide-64
SLIDE 64

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1

slide-65
SLIDE 65

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1 d2

slide-66
SLIDE 66

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same DTW distance: d1 d2

slide-67
SLIDE 67

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same DTW distance: d1 d2

  • ×
slide-68
SLIDE 68

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same DTW distance: d1 d2

  • ×

“apple” “like”

slide-69
SLIDE 69

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same DTW distance: d1 d2 d3

  • ×

“apple” “like”

slide-70
SLIDE 70

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same same DTW distance: d1 d2 d3

  • ×

“apple” “like”

slide-71
SLIDE 71

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same same DTW distance: d1 d2 d3

  • ×
  • “apple”

“like”

slide-72
SLIDE 72

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same different same different DTW distance: d1 d2 d3 d4 dN

  • ×
  • ×
slide-73
SLIDE 73

Maties Machine Learning (MML)

  • Send “subscribe mml” in subject line to sympa@sympa.sun.ac.za
  • Mailing list: mml@sympa.sun.ac.za
  • Bring together machine learning researchers from across Stellenbosch

University

  • Format: Short (in)formal talks every second Friday over lunch
  • Focus on machine learning research
  • If you want to give a talk, or have any ideas, please let us know!
  • kamperh@sun.ac.za, wbrink@sun.ac.za