Deep convolutional acoustic word embeddings using word-pair side - - PowerPoint PPT Presentation

deep convolutional acoustic word embeddings using word
SMART_READER_LITE
LIVE PREVIEW

Deep convolutional acoustic word embeddings using word-pair side - - PowerPoint PPT Presentation

Deep convolutional acoustic word embeddings using word-pair side information Herman Kamper 1 , Weiran Wang 2 , Karen Livescu 2 1 CSTR and ILCC, School of Informatics, University of Edinburgh, UK 2 Toyota Technological Institute at Chicago, USA


slide-1
SLIDE 1

Deep convolutional acoustic word embeddings using word-pair side information

Herman Kamper1, Weiran Wang2, Karen Livescu2

1CSTR and ILCC, School of Informatics, University of Edinburgh, UK 2Toyota Technological Institute at Chicago, USA

ICASSP 2016

slide-2
SLIDE 2

Introduction

◮ Most speech processing systems rely on deep architectures to classify speech

frames into subword units (HMM triphone states).

◮ Requires pronunciation dictionary for breaking words into subwords; in many

cases still makes frame-level independence assumptions.

◮ Some studies have started to reconsider whole words as basic modelling unit

[Heigold et al., 2012; Chen et al., 2015].

2 / 17

slide-3
SLIDE 3

Segmental automatic speech recognition

Segmental conditional random field ASR [Maas et al., 2012]:

Andrew, f1=0 ran, f1=1

Whole-word lattice rescoring [Bengio and Heigold, 2014]:

3 / 17

slide-4
SLIDE 4

Segmental query-by-example search

From [Levin et al., 2015]:

LapEig NN Index Search audio segments Segment embeddings Query result(s) Query embedding Query audio LapEig

  • Fig. 1. Diagram of the S-RAILS audio search system.

[Chen et al., 2015]: Similar scheme for “Okay Google” using LSTMs.

4 / 17

slide-5
SLIDE 5

Segmental query-by-example search

From [Levin et al., 2015]:

LapEig NN Index Search audio segments Segment embeddings Query result(s) Query embedding Query audio LapEig

  • Fig. 1. Diagram of the S-RAILS audio search system.

[Chen et al., 2015]: Similar scheme for “Okay Google” using LSTMs. In this work, we also use a query-related task for evaluation.

4 / 17

slide-6
SLIDE 6

Acoustic word embedding problem

xi ∈ Rd in d-dimensional space

f(Y1) f(Y2) Y2 Y1

5 / 17

slide-7
SLIDE 7

Reference vector method [Levin et al., 2013]

6 / 17

slide-8
SLIDE 8

Reference vector method [Levin et al., 2013]

Segment we want to embed: yt1:t2

6 / 17

slide-9
SLIDE 9

Reference vector method [Levin et al., 2013]

Reference set Yref: Segment we want to embed: yt1:t2

6 / 17

slide-10
SLIDE 10

Reference vector method [Levin et al., 2013]

Reference set Yref: Segment we want to embed: yt1:t2 Dist1

6 / 17

slide-11
SLIDE 11

Reference vector method [Levin et al., 2013]

Reference set Yref: Segment we want to embed: yt1:t2 Dist1 Dist2

6 / 17

slide-12
SLIDE 12

Reference vector method [Levin et al., 2013]

Reference set Yref: Segment we want to embed: yt1:t2 Dist1 Dist2 Dist3 Dist4 Distm

6 / 17

slide-13
SLIDE 13

Reference vector method [Levin et al., 2013]

∈ Rm Reference set Yref: Segment we want to embed: yt1:t2 Dist1 Dist2 Dist3 Dist4 Distm

6 / 17

slide-14
SLIDE 14

Reference vector method [Levin et al., 2013]

∈ Rm ×P P ∈ Rm×d Dimensionality reduction: Reference set Yref: Segment we want to embed: yt1:t2 Dist1 Dist2 Dist3 Dist4 Distm

6 / 17

slide-15
SLIDE 15

Reference vector method [Levin et al., 2013]

∈ Rm ×P P ∈ Rm×d Dimensionality reduction: ∈ Rd Reference set Yref: Segment we want to embed: Embedding: in fixed d-dimensional space yt1:t2 xi = f(yt1:t2) Dist1 Dist2 Dist3 Dist4 Distm

6 / 17

slide-16
SLIDE 16

Word classification CNN [Bengio and Heigold, 2014]

7 / 17

slide-17
SLIDE 17

Word classification CNN [Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0

wi

7 / 17

slide-18
SLIDE 18

Word classification CNN [Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 softmax

wi

7 / 17

slide-19
SLIDE 19

Word classification CNN [Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 softmax

wi

7 / 17

slide-20
SLIDE 20

Word classification CNN [Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 softmax

wi

7 / 17

slide-21
SLIDE 21

Word classification CNN [Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 softmax

wi

7 / 17

slide-22
SLIDE 22

Word classification CNN [Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 softmax

wi

7 / 17

slide-23
SLIDE 23

Word classification CNN [Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution softmax

wi

7 / 17

slide-24
SLIDE 24

Word classification CNN [Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution softmax

wi

7 / 17

slide-25
SLIDE 25

Word classification CNN [Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution softmax

wi

7 / 17

slide-26
SLIDE 26

Word classification CNN [Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution max softmax

wi

7 / 17

slide-27
SLIDE 27

Word classification CNN [Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution max softmax

wi ×nconv

7 / 17

slide-28
SLIDE 28

Word classification CNN [Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution max softmax fully connected

wi ×nconv

7 / 17

slide-29
SLIDE 29

Word classification CNN [Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution max softmax fully connected

wi ×nfull ×nconv

7 / 17

slide-30
SLIDE 30

Word classification CNN [Bengio and Heigold, 2014]

Yi

0 0 0 · · · 1 · · · 0 0 convolution max softmax fully connected

xi = f(Yi) wi ×nfull ×nconv

7 / 17

slide-31
SLIDE 31

Supervision and side information

◮ The word classifier CNN assumes a corpus of labelled word segments. ◮ In some cases these might not be available. ◮ Weaker form of supervision we sometimes have (e.g. [Thiolli`

ere et al., 2015]) are known word pairs: Strain = {(m, n) : (Ym, Yn) are of the same type}

◮ Also aligns with query / word discrimination task: does two speech segments

contain instances of the same word? (Don’t care about word identity.)

8 / 17

slide-32
SLIDE 32

Supervision and side information

◮ The word classifier CNN assumes a corpus of labelled word segments. ◮ In some cases these might not be available. ◮ Weaker form of supervision we sometimes have (e.g. [Thiolli`

ere et al., 2015]) are known word pairs: Strain = {(m, n) : (Ym, Yn) are of the same type}

◮ Also aligns with query / word discrimination task: does two speech segments

contain instances of the same word? (Don’t care about word identity.) Can we use this weak supervision (sometimes called side information) to train an acoustic word embedding function f ?

8 / 17

slide-33
SLIDE 33

Word similarity Siamese CNN

Use idea of Siamese networks [Bromley et al., 1993].

9 / 17

slide-34
SLIDE 34

Word similarity Siamese CNN

Use idea of Siamese networks [Bromley et al., 1993].

Y1 x1 = f(Y1) Y2 x2 = f(Y2)

9 / 17

slide-35
SLIDE 35

Word similarity Siamese CNN

Use idea of Siamese networks [Bromley et al., 1993].

Y1 x1 = f(Y1) Y2 x2 = f(Y2) distance l(x1, x2)

9 / 17

slide-36
SLIDE 36

Loss functions

10 / 17

slide-37
SLIDE 37

Loss functions

The coscos2 loss [Synnaeve et al., 2014]: lcos cos2(x1, x2) = 1−cos(x1,x2)

2

if same cos2(x1, x2) if different same different

10 / 17

slide-38
SLIDE 38

Loss functions

The coscos2 loss [Synnaeve et al., 2014]: lcos cos2(x1, x2) = 1−cos(x1,x2)

2

if same cos2(x1, x2) if different same different Margin-based hinge loss [Mikolov, 2013]: lcos hinge = max {0, m + dcos(x1, x2) − dcos(x1, x3)} where dcos(x1, x2) = 1−cos(x1,x2)

2

is the cosine distance between x1 and x2, and m is a margin parameter. Pair (x1, x2) are same, (x1, x3) are different.

10 / 17

slide-39
SLIDE 39

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

11 / 17

slide-40
SLIDE 40

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like”

11 / 17

slide-41
SLIDE 41

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like”

11 / 17

slide-42
SLIDE 42

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “apple” Treat as query

11 / 17

slide-43
SLIDE 43

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” Treat as query Treat as terms to search

11 / 17

slide-44
SLIDE 44

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple”

11 / 17

slide-45
SLIDE 45

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple”

11 / 17

slide-46
SLIDE 46

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: Cosine distance: d1

11 / 17

slide-47
SLIDE 47

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different Cosine distance: d1

11 / 17

slide-48
SLIDE 48

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different Cosine distance: d1

  • 11 / 17
slide-49
SLIDE 49

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different Cosine distance: d1

  • 11 / 17
slide-50
SLIDE 50

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different Cosine distance: d1 d2

  • 11 / 17
slide-51
SLIDE 51

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same Cosine distance: d1 d2

  • 11 / 17
slide-52
SLIDE 52

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same Cosine distance: d1 d2

  • ×

11 / 17

slide-53
SLIDE 53

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same Cosine distance: d1 d2

  • ×

11 / 17

slide-54
SLIDE 54

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same Cosine distance: d1 d2 d3

  • ×

11 / 17

slide-55
SLIDE 55

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same same Cosine distance: d1 d2 d3

  • ×

11 / 17

slide-56
SLIDE 56

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same same Cosine distance: d1 d2 d3

  • ×
  • 11 / 17
slide-57
SLIDE 57

Embedding evaluation: the same-different task

Proposed in [Carlin et al., 2011] and also used in [Levin et al., 2013].

“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same different same different Cosine distance: d1 d2 d3 d4 dN

  • ×
  • ×
  • 11 / 17
slide-58
SLIDE 58

Experimental setup

◮ Speech from Switchboard is used for evaluation. ◮ Training set: 10k word tokens; sampled 100k training word pairs. ◮ Test set for same-different evaluation: 11k word tokens, 60.7M pairs, 3%

produced by same speaker.

◮ Used a comparable development set.

12 / 17

slide-59
SLIDE 59

Network architectures: Word classifier CNN

39-dimensional padded MFCCs, npad = 200 1-D convolution: 96 ReLU filters over 9 frames Max-pooling: 3 units 1-D convolution: 96 ReLU filters over 8 units Max-pooling: 3 units 1024 ReLU Linear Bottleneck (optional) softmax: 1061 classes

13 / 17

slide-60
SLIDE 60

Network architectures: Word classifier CNN

39-dimensional padded MFCCs, npad = 200 1-D convolution: 96 ReLU filters over 9 frames Max-pooling: 3 units 1-D convolution: 96 ReLU filters over 8 units Max-pooling: 3 units 1024 ReLU Linear Bottleneck (optional) softmax: 1061 classes

13 / 17

slide-61
SLIDE 61

Network architectures: Word classifier CNN

39-dimensional padded MFCCs, npad = 200 1-D convolution: 96 ReLU filters over 9 frames Max-pooling: 3 units 1-D convolution: 96 ReLU filters over 8 units Max-pooling: 3 units 1024 ReLU Linear Bottleneck (optional) softmax: 1061 classes

13 / 17

slide-62
SLIDE 62

Network architectures: Siamese CNN

39-dimensional padded MFCCs, npad = 200 1-D convolution: 96 ReLU filters over 9 frames Max-pooling: 3 units 1-D convolution: 96 ReLU filters over 8 units Max-pooling: 3 units 2048 ReLU 1024 Linear l(x1, x2) distance 39-dimensional padded MFCCs, npad = 200 1-D convolution: 96 ReLU filters over 9 frames Max-pooling: 3 units 1-D convolution: 96 ReLU filters over 8 units Max-pooling: 3 units 2048 ReLU 1024 Linear

14 / 17

slide-63
SLIDE 63

Network architectures: Siamese CNN

39-dimensional padded MFCCs, npad = 200 1-D convolution: 96 ReLU filters over 9 frames Max-pooling: 3 units 1-D convolution: 96 ReLU filters over 8 units Max-pooling: 3 units 2048 ReLU 1024 Linear l(x1, x2) distance 39-dimensional padded MFCCs, npad = 200 1-D convolution: 96 ReLU filters over 9 frames Max-pooling: 3 units 1-D convolution: 96 ReLU filters over 8 units Max-pooling: 3 units 2048 ReLU 1024 Linear

14 / 17

slide-64
SLIDE 64

Results

Representation Dim AP

15 / 17

slide-65
SLIDE 65

Results

Representation Dim AP DTW MFCCs with CMVN 39 0.214

15 / 17

slide-66
SLIDE 66

Results

Representation Dim AP DTW MFCCs with CMVN 39 0.214 Correspondence autoencoder [Kamper et al., 2015] 100 0.469

15 / 17

slide-67
SLIDE 67

Results

Representation Dim AP DTW MFCCs with CMVN 39 0.214 Correspondence autoencoder [Kamper et al., 2015] 100 0.469 Acoustic word embed. Reference vector approach [Levin et al., 2013] 50 0.365

15 / 17

slide-68
SLIDE 68

Results

Representation Dim AP DTW MFCCs with CMVN 39 0.214 Correspondence autoencoder [Kamper et al., 2015] 100 0.469 Acoustic word embed. Reference vector approach [Levin et al., 2013] 50 0.365 Word classifier CNN 1061 0.532 ± 0.014

15 / 17

slide-69
SLIDE 69

Results

Representation Dim AP DTW MFCCs with CMVN 39 0.214 Correspondence autoencoder [Kamper et al., 2015] 100 0.469 Acoustic word embed. Reference vector approach [Levin et al., 2013] 50 0.365 Word classifier CNN 1061 0.532 ± 0.014 50 0.474 ± 0.012

15 / 17

slide-70
SLIDE 70

Results

Representation Dim AP DTW MFCCs with CMVN 39 0.214 Correspondence autoencoder [Kamper et al., 2015] 100 0.469 Acoustic word embed. Reference vector approach [Levin et al., 2013] 50 0.365 Word classifier CNN 1061 0.532 ± 0.014 50 0.474 ± 0.012 Siamese CNN, lcos cos2 loss 1024 0.342 ± 0.026 Siamese CNN, lcos hinge loss 1024 0.549 ± 0.011

15 / 17

slide-71
SLIDE 71

Results

Representation Dim AP DTW MFCCs with CMVN 39 0.214 Correspondence autoencoder [Kamper et al., 2015] 100 0.469 Acoustic word embed. Reference vector approach [Levin et al., 2013] 50 0.365 Word classifier CNN 1061 0.532 ± 0.014 50 0.474 ± 0.012 Siamese CNN, lcos cos2 loss 1024 0.342 ± 0.026 Siamese CNN, lcos hinge loss 1024 0.549 ± 0.011 50 0.504 ± 0.011

15 / 17

slide-72
SLIDE 72

Results

Representation Dim AP DTW MFCCs with CMVN 39 0.214 Correspondence autoencoder [Kamper et al., 2015] 100 0.469 Acoustic word embed. Reference vector approach [Levin et al., 2013] 50 0.365 Word classifier CNN 1061 0.532 ± 0.014 50 0.474 ± 0.012 Siamese CNN, lcos cos2 loss 1024 0.342 ± 0.026 Siamese CNN, lcos hinge loss 1024 0.549 ± 0.011 50 0.504 ± 0.011 LDA on: lcos hinge, d = 1024 100 0.545 ± 0.011

15 / 17

slide-73
SLIDE 73

Varying dimensionalities on development data

10 30 100 300 1000 3000 Dimensionality of acoustic embedding (log scale) 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Average precision (AP)

Word classifier CNN Siamese CNN: lcos cos2 Siamese CNN: lcos hinge

16 / 17

slide-74
SLIDE 74

Varying dimensionalities on development data

10 30 100 300 1000 3000 Dimensionality of acoustic embedding (log scale) 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Average precision (AP)

Word classifier CNN Siamese CNN: lcos cos2 Siamese CNN: lcos hinge Siamese CNN with LDA

16 / 17

slide-75
SLIDE 75

Summary and conclusion

◮ Introduced the Siamese CNN for obtaining acoustic word embeddings, and

evaluated different cost functions.

◮ Evaluated using word discrimination task, and showed similar performance to

word classifier CNN.

◮ For smaller dimensionalities: Siamese CNN outperformed classifier CNN. ◮ Self-criticism: evaluated on a small dataset (low-resource setting). ◮ Future work: sequence models, using embeddings for search and ASR.

17 / 17

slide-76
SLIDE 76

Code

Neural networks (Theano): https://github.com/kamperh/couscous Complete recipe: https://github.com/kamperh/recipe_swbd_wordembeds