[PPT] - Unsupervised neural feature learning for speech using weak top-down PowerPoint Presentation

SLIDE 1

Unsupervised neural feature learning for speech using weak top-down constraints

Maties Machine Learning (MML), Oct. 2017 Herman Kamper

Stellenbosch University http://www.kamperh.com/

SLIDE 2

Success in speech recognition

1 / 18

SLIDE 3

Success in speech recognition

1 / 18

SLIDE 4

Success in speech recognition

1 / 18

SLIDE 5

Success in speech recognition

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

1 / 18

SLIDE 6

Success in speech recognition

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

1 / 18

SLIDE 7

Success in speech recognition

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)

1 / 18

SLIDE 8

Success in speech recognition

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)
An addiction to labels: 2000 hours transcribed speech audio;

∼350M/560M words text

1 / 18

SLIDE 9

Success in speech recognition

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)
An addiction to labels: 2000 hours transcribed speech audio;

∼350M/560M words text

1 / 18

i had to think

f

some example speech since speech recognition is really cool

SLIDE 10

Success in speech recognition

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)
An addiction to labels: 2000 hours transcribed speech audio;

∼350M/560M words text

But, there are around 7000 languages spoken in the world today

1 / 18

i had to think

f

some example speech since speech recognition is really cool

SLIDE 11

SLIDE 12

Why learn without labels?

3 / 18

SLIDE 13

Why learn without labels?

Get insight into human language acquisition [R¨

as¨ anen and Rasilo, ’15]

3 / 18

SLIDE 14

Why learn without labels?

Get insight into human language acquisition [R¨

as¨ anen and Rasilo, ’15]

Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]

3 / 18

SLIDE 15

Why learn without labels?

Get insight into human language acquisition [R¨

as¨ anen and Rasilo, ’15]

Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
Analysis of audio for unwritten languages [Besacier et al., ’14]

3 / 18

SLIDE 16

Why learn without labels?

Get insight into human language acquisition [R¨

as¨ anen and Rasilo, ’15]

Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
Analysis of audio for unwritten languages [Besacier et al., ’14]
New insights and models for speech processing

[Jansen et al., ’13]

3 / 18

SLIDE 17

Unsupervised term discovery (UTD)

[Park and Glass, TASLP’08] 4 / 18

SLIDE 18

Unsupervised term discovery (UTD)

[Park and Glass, TASLP’08] 4 / 18

SLIDE 19

Unsupervised term discovery (UTD)

[Park and Glass, TASLP’08] 4 / 18

SLIDE 20

Unsupervised term discovery (UTD)

[Park and Glass, TASLP’08] 4 / 18

SLIDE 21

Example: Query-by-example search

[Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]

5 / 18

SLIDE 22

Example: Query-by-example search

Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]

5 / 18

SLIDE 23

Example: Query-by-example search

Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]

5 / 18

SLIDE 24

Example: Query-by-example search

Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]

5 / 18

SLIDE 25

Example: Query-by-example search

Spoken query:

Useful speech system, not requiring any transcribed speech

[Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]

5 / 18

SLIDE 26

Unsupervised speech processing: Two problems

1. Unsupervised frame-level representation learning:

6 / 18

SLIDE 27

Unsupervised speech processing: Two problems

1. Unsupervised frame-level representation learning:

6 / 18

SLIDE 28

Unsupervised speech processing: Two problems

1. Unsupervised frame-level representation learning:

fa(·)

6 / 18

SLIDE 29

Unsupervised speech processing: Two problems

1. Unsupervised frame-level representation learning:

fa(·) fa(·) Cool model

6 / 18

SLIDE 30

Unsupervised speech processing: Two problems

1. Unsupervised frame-level representation learning:

fa(·) fa(·) Cool model

2. Unsupervised segmentation and clustering:

How do we discover meaningful units in unlabelled speech?

6 / 18

SLIDE 31

Unsupervised frame-level representation learning:

The Correspondence Autoencoder

SLIDE 32

Unsupervised frame-level representation learning:

The Correspondence Autoencoder

Micha Elsner Daniel Renshaw Aren Jansen Sharon Goldwater

SLIDE 33

Supervised representation learning using DNNs

ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks Output: predict phone states

8 / 18

SLIDE 34

Supervised representation learning using DNNs

ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks Output: predict phone states Feature extractor fa(·) learned from data Phone classifier learned jointly

8 / 18

SLIDE 35

Supervised representation learning using DNNs

ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks Output: predict phone states Feature extractor fa(·) learned from data Phone classifier learned jointly

Unsupervised modelling: No phone class targets to train network on

8 / 18

SLIDE 36

Autoencoder (AE) neural network

Input speech frame Reconstruct input

[Badino et al., ICASSP’14] 9 / 18

SLIDE 37

Autoencoder (AE) neural network

Input speech frame Reconstruct input

[Badino et al., ICASSP’14]

Completely unsupervised
But purely bottom-up
Can we use top-down information?

9 / 18

SLIDE 38

Autoencoder (AE) neural network

Input speech frame Reconstruct input

[Badino et al., ICASSP’14]

Completely unsupervised
But purely bottom-up
Can we use top-down information?
Idea: Unsupervised term discovery

9 / 18

SLIDE 39

Unsupervised term discovery (UTD)

10 / 18

SLIDE 40

Unsupervised term discovery (UTD)

Can we use these discovered word pairs to give weak top-down supervision?

10 / 18

SLIDE 41

Weak top-down supervision: Align frames

[Jansen et al., ICASSP’13] 11 / 18

SLIDE 42

Weak top-down supervision: Align frames

[Jansen et al., ICASSP’13] 11 / 18

SLIDE 43

Weak top-down supervision: Align frames

[Jansen et al., ICASSP’13] 11 / 18

SLIDE 44

Autoencoder (AE)

Input speech frame Reconstruct input

12 / 18

SLIDE 45

Correspondence autoencoder (cAE)

Frame from one word Frame from other word in pair

13 / 18

SLIDE 46

Correspondence autoencoder (cAE)

Frame from one word Unsupervised feature extractor fa(·) Frame from other word in pair

13 / 18

SLIDE 47

Correspondence autoencoder (cAE)

Frame from one word Unsupervised feature extractor fa(·) Frame from other word in pair

Combine top-down and bottom-up information

13 / 18

SLIDE 48

Correspondence autoencoder (cAE)

Frame from one word Unsupervised feature extractor Frame from other word in pair

Play Play

14 / 18

SLIDE 49

Correspondence autoencoder (cAE)

Speech corpus Initialize weights Train stacked autoencoder (pretraining) Align word pair frames Train correspondence autoencoder (1) (2) (3) (4) Unsupervised term discovery Unsupervised feature extractor

[Kamper et al., ICASSP’15] 15 / 18

SLIDE 50

Evaluation: Query-by-example search

Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]

16 / 18

SLIDE 51

Evaluation: Isolated word query-by-example

Autoencoder UBM-GMM TopUBM cAE 0.0 0.1 0.2 0.3 0.4 0.5 Average precision

17 / 18

SLIDE 52

Evaluation: Isolated word query-by-example

Autoencoder UBM-GMM TopUBM cAE 0.0 0.1 0.2 0.3 0.4 0.5 Average precision Extended: [Renshaw et al., IS’15] and [Yuan et al., IS’16]

17 / 18

SLIDE 53

Summary and conclusion

Introduced correspondence autoencoder (cAE) for unsupervised

frame-level representation learning

Uses top-down information from unsupervised term discovery system
Uses bottom-up initialization on large speech corpus
Unsupervised neural network model that combines top-down and

bottom-up information results in large intrinsic improvements

Links with language acquisition research
Future: More analysis; different domains; practical search systems

18 / 18

SLIDE 54

http://www.kamperh.com/ https://github.com/kamperh

SLIDE 55

Evaluation of features: same-different task

SLIDE 56

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like”

SLIDE 57

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “apple” Treat as query

SLIDE 58

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “apple” Treat as query “pie” “grape” “apple” “apple” “like” Treat as terms to search

SLIDE 59

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple”

SLIDE 60

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple”

SLIDE 61

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” DTW distance: d1

SLIDE 62

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1

SLIDE 63

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1

SLIDE 64

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1

SLIDE 65

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1 d2

SLIDE 66

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same DTW distance: d1 d2

SLIDE 67

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same DTW distance: d1 d2

×

SLIDE 68

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same DTW distance: d1 d2

×

“apple” “like”

SLIDE 69

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same DTW distance: d1 d2 d3

×

“apple” “like”

SLIDE 70

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same same DTW distance: d1 d2 d3

×

“apple” “like”

SLIDE 71

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same same DTW distance: d1 d2 d3

×
“apple”

“like”

SLIDE 72

Evaluation of features: same-different task

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same different same different DTW distance: d1 d2 d3 d4 dN

×
×