Unsupervised speech processing using acoustic word embeddings - - PowerPoint PPT Presentation

unsupervised speech processing using acoustic word
SMART_READER_LITE
LIVE PREVIEW

Unsupervised speech processing using acoustic word embeddings - - PowerPoint PPT Presentation

Unsupervised speech processing using acoustic word embeddings Herman Kamper School of Informatics, University of Edinburgh TTI at Chicago MLSLP 2016: Spotlight invited talk Unsupervised speech processing Speech recognition applications


slide-1
SLIDE 1

Unsupervised speech processing using acoustic word embeddings

Herman Kamper

School of Informatics, University of Edinburgh → TTI at Chicago

MLSLP 2016: Spotlight invited talk

slide-2
SLIDE 2

Unsupervised speech processing

  • Speech recognition applications are becoming wide-spread
  • Google Voice Search already supports more than 50 languages:

English, Spanish, German, . . . , Afrikaans, Zulu

1 / 17

slide-3
SLIDE 3

Unsupervised speech processing

  • Speech recognition applications are becoming wide-spread
  • Google Voice Search already supports more than 50 languages:

English, Spanish, German, . . . , Afrikaans, Zulu

  • But there are roughly 7000 languages spoken in the world!
  • Audio data are becoming available, even for languages spoken by
  • nly a few speakers, but generally unlabelled

1 / 17

slide-4
SLIDE 4

Unsupervised speech processing

  • Speech recognition applications are becoming wide-spread
  • Google Voice Search already supports more than 50 languages:

English, Spanish, German, . . . , Afrikaans, Zulu

  • But there are roughly 7000 languages spoken in the world!
  • Audio data are becoming available, even for languages spoken by
  • nly a few speakers, but generally unlabelled
  • Goal: Unsupervised learning of linguistic structure directly from raw

speech audio, in order to develop zero-resource speech technology

1 / 17

slide-5
SLIDE 5

Motivation for unsupervised speech processing

Criticism:

  • Always some labelled data to start with (e.g. related language)
  • Small set of labelled data: semi-supervised problem

2 / 17

slide-6
SLIDE 6

Motivation for unsupervised speech processing

Criticism:

  • Always some labelled data to start with (e.g. related language)
  • Small set of labelled data: semi-supervised problem

Reasons for focusing on purely unsupervised case:

  • Modelling infant language acquisition

[R¨ as¨ anen, 2012]

  • Language acquisition in robotics

[Renkens and Van hamme, 2015]

  • Practical use of zero-resource technology: Allow linguists to analyze

and investigate unwritten languages

[Besacier et al., 2014]

  • New insights and models for speech processing: E.g. unsupervised

methods can improve supervised systems

[Jansen et al., 2012]

2 / 17

slide-7
SLIDE 7

Unsupervised segmentation and clustering

Full-coverage segmentation:

3 / 17

slide-8
SLIDE 8

Unsupervised segmentation and clustering

Full-coverage segmentation:

3 / 17

slide-9
SLIDE 9

Unsupervised segmentation and clustering

Full-coverage segmentation:

3 / 17

slide-10
SLIDE 10

Segmental modelling for full-coverage segmentation

4 / 17

slide-11
SLIDE 11

Segmental modelling for full-coverage segmentation

Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015]:

4 / 17

slide-12
SLIDE 12

Segmental modelling for full-coverage segmentation

Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015]: Our approach uses whole-word segmental representations, i.e. acoustic word embeddings

[Kamper et al., IS’15; Kamper et al., TASLP’16]

4 / 17

slide-13
SLIDE 13

Acoustic word embeddings

5 / 17

slide-14
SLIDE 14

Acoustic word embeddings

xi ∈ Rd in d-dimensional space fe(Y1) fe(Y2) Y2 Y1 5 / 17

slide-15
SLIDE 15

Acoustic word embeddings

xi ∈ Rd in d-dimensional space fe(Y1) fe(Y2) Y2 Y1

Dynamic programming alignment has quadratic complexity, while embedding comparison is linear time. Can use standard clustering.

5 / 17

slide-16
SLIDE 16

An unsupervised segmental Bayesian model

Speech waveform 6 / 17

slide-17
SLIDE 17

An unsupervised segmental Bayesian model

Acoustic frames y1:M fa(·) fa(·) fa(·) Speech waveform 6 / 17

slide-18
SLIDE 18

An unsupervised segmental Bayesian model

Acoustic frames y1:M fa(·) fa(·) fa(·) Speech waveform 6 / 17

slide-19
SLIDE 19

An unsupervised segmental Bayesian model

fe(·) Embeddings xi = fe(yt1:t2) Acoustic frames y1:M fe(·) fe(·) fa(·) fa(·) fa(·) Speech waveform 6 / 17

slide-20
SLIDE 20

An unsupervised segmental Bayesian model

Bayesian Gaussian mixture model fe(·) Embeddings xi = fe(yt1:t2) Acoustic frames y1:M p(xi|h−) fe(·) fe(·) fa(·) fa(·) fa(·) Speech waveform 6 / 17

slide-21
SLIDE 21

An unsupervised segmental Bayesian model

Bayesian Gaussian mixture model fe(·) Acoustic modelling Embeddings xi = fe(yt1:t2) Acoustic frames y1:M p(xi|h−) fe(·) fe(·) fa(·) fa(·) fa(·) Speech waveform 6 / 17

slide-22
SLIDE 22

An unsupervised segmental Bayesian model

Bayesian Gaussian mixture model fe(·) Acoustic modelling Word segmentation Embeddings xi = fe(yt1:t2) Acoustic frames y1:M p(xi|h−) fe(·) fe(·) fa(·) fa(·) fa(·) Speech waveform 6 / 17

slide-23
SLIDE 23

Applied to a small-vocabulary task

7 / 17

slide-24
SLIDE 24

Applied to a small-vocabulary task

33 12 47 60 66 27 83 51 92 38 63 14 89 24 85 Cluster ID

  • ne

two three four five six seven eight nine

  • h

zero Ground truth type

7 / 17

slide-25
SLIDE 25

Applied to a large-vocabulary task

8 / 17

slide-26
SLIDE 26

Applied to a large-vocabulary task

T

  • k

e n T y p e B

  • u

n d a r y 10 20 30 40 50 60 70 F-score (%)

ZRSBaselineUTD (SI) UTDGraphCC (SI) SyllableSegOsc+ (SD) BayesSegMinDur-MFCC (SD) BayesSegMinDur-cAE (SI)

ZRSBaselineUTD: [Versteegh et al., 2015]; UTDGraphCC: [Lyzinski et al., 2015]; SyllableSegOsc+: [R¨ as¨ anen et al., 2015]

8 / 17

slide-27
SLIDE 27

Acoustic word embeddings

xi ∈ Rd in d-dimensional space fe(Y1) fe(Y2) Y2 Y1 9 / 17

slide-28
SLIDE 28

Acoustic word embeddings

Useful for more than just unsupervised modelling

10 / 17

slide-29
SLIDE 29

Acoustic word embeddings

Useful for more than just unsupervised modelling

  • Segmental conditional random field ASR

[Maas et al., 2012]:

Andrew, f1=0 ran, f1=1

  • Whole-word lattice rescoring [Bengio and Heigold, 2014]
  • Query-by-example search, e.g. [Chen et al., 2015] for

“Okay Google”:

10 / 17

slide-30
SLIDE 30

Acoustic word embeddings

Useful for more than just unsupervised modelling

  • Segmental conditional random field ASR

[Maas et al., 2012]:

Andrew, f1=0 ran, f1=1

  • Whole-word lattice rescoring [Bengio and Heigold, 2014]
  • Query-by-example search, e.g. [Chen et al., 2015] for

“Okay Google”:

10 / 17

slide-31
SLIDE 31

Word classification CNN

Fully supervised approach

[Bengio and Heigold, 2014]

11 / 17

slide-32
SLIDE 32

Word classification CNN

Fully supervised approach

[Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0

wi

11 / 17

slide-33
SLIDE 33

Word classification CNN

Fully supervised approach

[Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 softmax

wi

11 / 17

slide-34
SLIDE 34

Word classification CNN

Fully supervised approach

[Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 softmax

wi

11 / 17

slide-35
SLIDE 35

Word classification CNN

Fully supervised approach

[Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 softmax

wi

11 / 17

slide-36
SLIDE 36

Word classification CNN

Fully supervised approach

[Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 softmax

wi

11 / 17

slide-37
SLIDE 37

Word classification CNN

Fully supervised approach

[Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 softmax

wi

11 / 17

slide-38
SLIDE 38

Word classification CNN

Fully supervised approach

[Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution softmax

wi

11 / 17

slide-39
SLIDE 39

Word classification CNN

Fully supervised approach

[Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution softmax

wi

11 / 17

slide-40
SLIDE 40

Word classification CNN

Fully supervised approach

[Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution softmax

wi

11 / 17

slide-41
SLIDE 41

Word classification CNN

Fully supervised approach

[Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution max softmax

wi

11 / 17

slide-42
SLIDE 42

Word classification CNN

Fully supervised approach

[Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution max softmax

wi ×nconv

11 / 17

slide-43
SLIDE 43

Word classification CNN

Fully supervised approach

[Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution max softmax fully connected

wi ×nconv

11 / 17

slide-44
SLIDE 44

Word classification CNN

Fully supervised approach

[Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution max softmax fully connected

wi ×nfull ×nconv

11 / 17

slide-45
SLIDE 45

Word classification CNN

Fully supervised approach

[Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution max softmax fully connected

xi = fe(Yi) wi ×nfull ×nconv

11 / 17

slide-46
SLIDE 46

Word similarity Siamese CNN

Weak supervision we sometimes have [Thiolli`

ere et al., 2015] are known

word pairs: Strain = {(m, n) : (Ym, Yn) are of the same type}

12 / 17

slide-47
SLIDE 47

Word similarity Siamese CNN

Weak supervision we sometimes have [Thiolli`

ere et al., 2015] are known

word pairs: Strain = {(m, n) : (Ym, Yn) are of the same type} Use idea of Siamese networks [Bromley et al., 1993]

12 / 17

slide-48
SLIDE 48

Word similarity Siamese CNN

Weak supervision we sometimes have [Thiolli`

ere et al., 2015] are known

word pairs: Strain = {(m, n) : (Ym, Yn) are of the same type} Use idea of Siamese networks [Bromley et al., 1993]

Y1 x1 = fe(Y1) Y2 x2 = fe(Y2) 12 / 17

slide-49
SLIDE 49

Word similarity Siamese CNN

Weak supervision we sometimes have [Thiolli`

ere et al., 2015] are known

word pairs: Strain = {(m, n) : (Ym, Yn) are of the same type} Use idea of Siamese networks [Bromley et al., 1993]

Y1 x1 = fe(Y1) Y2 x2 = fe(Y2) distance l(x1, x2) 12 / 17

slide-50
SLIDE 50

Triplet margin-based loss

same different Margin-based triplet hinge loss [Mikolov, 2013]: ltriplets = max {0, m + dcos(x1, x2) − dcos(x1, x3)} where dcos(x1, x2) = 1−cos(x1,x2)

2

is the cosine distance between x1 and x2, and m is a margin parameter. Pair (x1, x2) is same, (x1, x3) is different.

13 / 17

slide-51
SLIDE 51

Evaluation of acoustic word embeddings

Downsampling Reference vector Word classifier CNN Siamese CNN 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Average precision

14 / 17

slide-52
SLIDE 52

Evaluation of acoustic word embeddings

Downsampling Reference vector Word classifier CNN Siamese CNN 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Average precision But Siamese CNN still uses weak supervision. Still work to be done for unsupervised case, e.g. [Chung et al., IS’16].

14 / 17

slide-53
SLIDE 53

Looking forward

15 / 17

slide-54
SLIDE 54

Looking forward

  • Much to be done in zero-resource speech processing
  • Core issues: evaluation; what do we want to discover?

15 / 17

slide-55
SLIDE 55

Looking forward

  • Much to be done in zero-resource speech processing
  • Core issues: evaluation; what do we want to discover?
  • Do these models allow us to model language acquisition in human

infants?

15 / 17

slide-56
SLIDE 56

Looking forward

  • Much to be done in zero-resource speech processing
  • Core issues: evaluation; what do we want to discover?
  • Do these models allow us to model language acquisition in human

infants?

  • Can these models be used for language

acquisition in robotic applications?

  • Extensions to multiple modalities

15 / 17

slide-57
SLIDE 57

Take-aways

  • Unsupervised, or zero-resource, speech processing is an important

and cool problem

  • Segmental acoustic word embeddings is a sensible way to approach

unsupervised segmentation and clustering, and is cool in general

  • Interesting to look at speech problems from a different perspective:

allows you to play around with cool models, and get new insights

16 / 17

slide-58
SLIDE 58

Poster: Better features using the correspondence autoencoder

17 / 17

slide-59
SLIDE 59

Poster: Better features using the correspondence autoencoder

Two problems in zero-resource speech processing:

  • 1. Unsupervised segmentation and clustering

17 / 17

slide-60
SLIDE 60

Poster: Better features using the correspondence autoencoder

Two problems in zero-resource speech processing:

  • 1. Unsupervised segmentation and clustering
  • 2. Unsupervised frame-level representation learning:

fa(·) Cool model

17 / 17

slide-61
SLIDE 61

Code: https://github.com/kamperh

slide-62
SLIDE 62

References I

  • Abdel-Hamid, O., Deng, L., Yu, D., and Jiang, H. (2013).

Deep segmental neural networks for speech recognition. In Proc. Interspeech.

  • Bengio, S. and Heigold, G. (2014).

Word embeddings for speech recognition. In Proc. Interspeech.

  • Besacier, L., Barnard, E., Karpov, A., and Schultz, T. (2014).

Automatic speech recognition for under-resourced languages: A survey. Speech Commun., 56:85–100.

  • Bromley, J., Bentz, J. W., Bottou, L., Guyon, I., LeCun, Y., Moore,

C., S¨ ackinger, E., and Shah, R. (1993). Signature verification using a ‘Siamese’ time delay neural network.

  • Int. J. Pattern Rec., 7(4):669–688.
slide-63
SLIDE 63

References II

  • Chen, G., Parada, C., and Sainath, T. N. (2015).

Query-by-example keyword spotting using long short-term memory networks. In Proc. ICASSP.

  • Chung, Y.-A., Wu, C.-C., Shen, C.-H., and Lee, H.-Y. (2016).

Unsupervised learning of audio segment representations using sequence-to-sequence recurrent neural networks.

  • Proc. Interspeech.
  • Jansen, A. et al. (2013).

A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition. In Proc. ICASSP.

slide-64
SLIDE 64

References III

  • Kamper, H., Goldwater, S. J., and Jansen, A. (2015).

Fully unsupervised small-vocabulary speech recognition using a segmental Bayesian model. In Proc. Interspeech.

  • Kamper, H., Jansen, A., and Goldwater, S. J. (2016a).

Unsupervised word segmentation and lexicon discovery using acoustic word embeddings. IEEE/ACM Trans. Audio, Speech, Language Process., 24(4):669–679.

  • Kamper, H., Wang, W., and Livescu, K. (2016b).

Deep convolutional acoustic word embeddings using word-pair side information. In Proc. ICASSP.

slide-65
SLIDE 65

References IV

  • Lee, C.-y., O’Donnell, T., and Glass, J. R. (2015).

Unsupervised lexicon discovery from acoustic input.

  • Trans. ACL, 3:389–403.
  • Levin, K., Henry, K., Jansen, A., and Livescu, K. (2013).

Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings. In Proc. ASRU.

  • Levin, K., Jansen, A., and Van Durme, B. (2015).

Segmental acoustic indexing for zero resource keyword search. In Proc. ICASSP.

  • Lyzinski, V., Sell, G., and Jansen, A. (2015).

An evaluation of graph clustering methods for unsupervised term discovery. In Proc. Interspeech.

slide-66
SLIDE 66

References V

  • Maas, A. L., Miller, S. D., O’Neil, T. M., Ng, A. Y., and Nguyen, P.

(2012). Word-level acoustic modeling with convolutional vector regression. In Proc. ICML Workshop Representation Learn.

  • Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

as¨ anen, O. J. (2012). Computational modeling of phonetic and lexical learning in early language acquisition: Existing models and future directions. Speech Commun., 54:975–997.

slide-67
SLIDE 67

References VI

as¨ anen, O. J., Doyle, G., and Frank, M. C. (2015). Unsupervised word discovery from speech using automatic segmentation into syllable-like units. In Proc. Interspeech.

  • Renkens, V. and Van hamme, H. (2015).

Mutually exclusive grounding for weakly supervised non-negative matrix factorisation. In Proc. Interspeech.

  • Thiolli`

ere, R., Dunbar, E., Synnaeve, G., Versteegh, M., and Dupoux,

  • E. (2015).

A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling. In Proc. Interspeech.

slide-68
SLIDE 68

References VII

  • Versteegh, M., Thiolli`

ere, R., Schatz, T., Cao, X. N., Anguera, X., Jansen, A., and Dupoux, E. (2015). The Zero Resource Speech Challenge 2015. In Proc. Interspeech.