SLIDE 1 Unsupervised speech processing using acoustic word embeddings
Herman Kamper
School of Informatics, University of Edinburgh → TTI at Chicago
MLSLP 2016: Spotlight invited talk
SLIDE 2 Unsupervised speech processing
- Speech recognition applications are becoming wide-spread
- Google Voice Search already supports more than 50 languages:
English, Spanish, German, . . . , Afrikaans, Zulu
1 / 17
SLIDE 3 Unsupervised speech processing
- Speech recognition applications are becoming wide-spread
- Google Voice Search already supports more than 50 languages:
English, Spanish, German, . . . , Afrikaans, Zulu
- But there are roughly 7000 languages spoken in the world!
- Audio data are becoming available, even for languages spoken by
- nly a few speakers, but generally unlabelled
1 / 17
SLIDE 4 Unsupervised speech processing
- Speech recognition applications are becoming wide-spread
- Google Voice Search already supports more than 50 languages:
English, Spanish, German, . . . , Afrikaans, Zulu
- But there are roughly 7000 languages spoken in the world!
- Audio data are becoming available, even for languages spoken by
- nly a few speakers, but generally unlabelled
- Goal: Unsupervised learning of linguistic structure directly from raw
speech audio, in order to develop zero-resource speech technology
1 / 17
SLIDE 5 Motivation for unsupervised speech processing
Criticism:
- Always some labelled data to start with (e.g. related language)
- Small set of labelled data: semi-supervised problem
2 / 17
SLIDE 6 Motivation for unsupervised speech processing
Criticism:
- Always some labelled data to start with (e.g. related language)
- Small set of labelled data: semi-supervised problem
Reasons for focusing on purely unsupervised case:
- Modelling infant language acquisition
[R¨ as¨ anen, 2012]
- Language acquisition in robotics
[Renkens and Van hamme, 2015]
- Practical use of zero-resource technology: Allow linguists to analyze
and investigate unwritten languages
[Besacier et al., 2014]
- New insights and models for speech processing: E.g. unsupervised
methods can improve supervised systems
[Jansen et al., 2012]
2 / 17
SLIDE 7 Unsupervised segmentation and clustering
Full-coverage segmentation:
3 / 17
SLIDE 8 Unsupervised segmentation and clustering
Full-coverage segmentation:
3 / 17
SLIDE 9 Unsupervised segmentation and clustering
Full-coverage segmentation:
3 / 17
SLIDE 10 Segmental modelling for full-coverage segmentation
4 / 17
SLIDE 11 Segmental modelling for full-coverage segmentation
Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015]:
4 / 17
SLIDE 12 Segmental modelling for full-coverage segmentation
Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015]: Our approach uses whole-word segmental representations, i.e. acoustic word embeddings
[Kamper et al., IS’15; Kamper et al., TASLP’16]
4 / 17
SLIDE 13 Acoustic word embeddings
5 / 17
SLIDE 14
Acoustic word embeddings
xi ∈ Rd in d-dimensional space fe(Y1) fe(Y2) Y2 Y1 5 / 17
SLIDE 15 Acoustic word embeddings
xi ∈ Rd in d-dimensional space fe(Y1) fe(Y2) Y2 Y1
Dynamic programming alignment has quadratic complexity, while embedding comparison is linear time. Can use standard clustering.
5 / 17
SLIDE 16 An unsupervised segmental Bayesian model
Speech waveform 6 / 17
SLIDE 17 An unsupervised segmental Bayesian model
Acoustic frames y1:M fa(·) fa(·) fa(·) Speech waveform 6 / 17
SLIDE 18 An unsupervised segmental Bayesian model
Acoustic frames y1:M fa(·) fa(·) fa(·) Speech waveform 6 / 17
SLIDE 19 An unsupervised segmental Bayesian model
fe(·) Embeddings xi = fe(yt1:t2) Acoustic frames y1:M fe(·) fe(·) fa(·) fa(·) fa(·) Speech waveform 6 / 17
SLIDE 20 An unsupervised segmental Bayesian model
Bayesian Gaussian mixture model fe(·) Embeddings xi = fe(yt1:t2) Acoustic frames y1:M p(xi|h−) fe(·) fe(·) fa(·) fa(·) fa(·) Speech waveform 6 / 17
SLIDE 21 An unsupervised segmental Bayesian model
Bayesian Gaussian mixture model fe(·) Acoustic modelling Embeddings xi = fe(yt1:t2) Acoustic frames y1:M p(xi|h−) fe(·) fe(·) fa(·) fa(·) fa(·) Speech waveform 6 / 17
SLIDE 22 An unsupervised segmental Bayesian model
Bayesian Gaussian mixture model fe(·) Acoustic modelling Word segmentation Embeddings xi = fe(yt1:t2) Acoustic frames y1:M p(xi|h−) fe(·) fe(·) fa(·) fa(·) fa(·) Speech waveform 6 / 17
SLIDE 23 Applied to a small-vocabulary task
7 / 17
SLIDE 24 Applied to a small-vocabulary task
33 12 47 60 66 27 83 51 92 38 63 14 89 24 85 Cluster ID
two three four five six seven eight nine
zero Ground truth type
7 / 17
SLIDE 25 Applied to a large-vocabulary task
8 / 17
SLIDE 26 Applied to a large-vocabulary task
T
e n T y p e B
n d a r y 10 20 30 40 50 60 70 F-score (%)
ZRSBaselineUTD (SI) UTDGraphCC (SI) SyllableSegOsc+ (SD) BayesSegMinDur-MFCC (SD) BayesSegMinDur-cAE (SI)
ZRSBaselineUTD: [Versteegh et al., 2015]; UTDGraphCC: [Lyzinski et al., 2015]; SyllableSegOsc+: [R¨ as¨ anen et al., 2015]
8 / 17
SLIDE 27
Acoustic word embeddings
xi ∈ Rd in d-dimensional space fe(Y1) fe(Y2) Y2 Y1 9 / 17
SLIDE 28 Acoustic word embeddings
Useful for more than just unsupervised modelling
10 / 17
SLIDE 29 Acoustic word embeddings
Useful for more than just unsupervised modelling
- Segmental conditional random field ASR
[Maas et al., 2012]:
Andrew, f1=0 ran, f1=1
- Whole-word lattice rescoring [Bengio and Heigold, 2014]
- Query-by-example search, e.g. [Chen et al., 2015] for
“Okay Google”:
10 / 17
SLIDE 30 Acoustic word embeddings
Useful for more than just unsupervised modelling
- Segmental conditional random field ASR
[Maas et al., 2012]:
Andrew, f1=0 ran, f1=1
- Whole-word lattice rescoring [Bengio and Heigold, 2014]
- Query-by-example search, e.g. [Chen et al., 2015] for
“Okay Google”:
10 / 17
SLIDE 31 Word classification CNN
Fully supervised approach
[Bengio and Heigold, 2014]
11 / 17
SLIDE 32 Word classification CNN
Fully supervised approach
[Bengio and Heigold, 2014]
Yi
0 0 0 · · · 1 · · · 0 0
wi
11 / 17
SLIDE 33 Word classification CNN
Fully supervised approach
[Bengio and Heigold, 2014]
Yi
0 0 0 · · · 1 · · · 0 0 softmax
wi
11 / 17
SLIDE 34 Word classification CNN
Fully supervised approach
[Bengio and Heigold, 2014]
Yi
0 0 0 · · · 1 · · · 0 0 softmax
wi
11 / 17
SLIDE 35 Word classification CNN
Fully supervised approach
[Bengio and Heigold, 2014]
Yi
0 0 0 · · · 1 · · · 0 0 softmax
wi
11 / 17
SLIDE 36 Word classification CNN
Fully supervised approach
[Bengio and Heigold, 2014]
Yi
0 0 0 · · · 1 · · · 0 0 softmax
wi
11 / 17
SLIDE 37 Word classification CNN
Fully supervised approach
[Bengio and Heigold, 2014]
Yi
0 0 0 · · · 1 · · · 0 0 softmax
wi
11 / 17
SLIDE 38 Word classification CNN
Fully supervised approach
[Bengio and Heigold, 2014]
Yi
0 0 0 · · · 1 · · · 0 0 convolution softmax
wi
11 / 17
SLIDE 39 Word classification CNN
Fully supervised approach
[Bengio and Heigold, 2014]
Yi
0 0 0 · · · 1 · · · 0 0 convolution softmax
wi
11 / 17
SLIDE 40 Word classification CNN
Fully supervised approach
[Bengio and Heigold, 2014]
Yi
0 0 0 · · · 1 · · · 0 0 convolution softmax
wi
11 / 17
SLIDE 41 Word classification CNN
Fully supervised approach
[Bengio and Heigold, 2014]
Yi
0 0 0 · · · 1 · · · 0 0 convolution max softmax
wi
11 / 17
SLIDE 42 Word classification CNN
Fully supervised approach
[Bengio and Heigold, 2014]
Yi
0 0 0 · · · 1 · · · 0 0 convolution max softmax
wi ×nconv
11 / 17
SLIDE 43 Word classification CNN
Fully supervised approach
[Bengio and Heigold, 2014]
Yi
0 0 0 · · · 1 · · · 0 0 convolution max softmax fully connected
wi ×nconv
11 / 17
SLIDE 44 Word classification CNN
Fully supervised approach
[Bengio and Heigold, 2014]
Yi
0 0 0 · · · 1 · · · 0 0 convolution max softmax fully connected
wi ×nfull ×nconv
11 / 17
SLIDE 45 Word classification CNN
Fully supervised approach
[Bengio and Heigold, 2014]
Yi
0 0 0 · · · 1 · · · 0 0 convolution max softmax fully connected
xi = fe(Yi) wi ×nfull ×nconv
11 / 17
SLIDE 46 Word similarity Siamese CNN
Weak supervision we sometimes have [Thiolli`
ere et al., 2015] are known
word pairs: Strain = {(m, n) : (Ym, Yn) are of the same type}
12 / 17
SLIDE 47 Word similarity Siamese CNN
Weak supervision we sometimes have [Thiolli`
ere et al., 2015] are known
word pairs: Strain = {(m, n) : (Ym, Yn) are of the same type} Use idea of Siamese networks [Bromley et al., 1993]
12 / 17
SLIDE 48
Word similarity Siamese CNN
Weak supervision we sometimes have [Thiolli`
ere et al., 2015] are known
word pairs: Strain = {(m, n) : (Ym, Yn) are of the same type} Use idea of Siamese networks [Bromley et al., 1993]
Y1 x1 = fe(Y1) Y2 x2 = fe(Y2) 12 / 17
SLIDE 49
Word similarity Siamese CNN
Weak supervision we sometimes have [Thiolli`
ere et al., 2015] are known
word pairs: Strain = {(m, n) : (Ym, Yn) are of the same type} Use idea of Siamese networks [Bromley et al., 1993]
Y1 x1 = fe(Y1) Y2 x2 = fe(Y2) distance l(x1, x2) 12 / 17
SLIDE 50 Triplet margin-based loss
same different Margin-based triplet hinge loss [Mikolov, 2013]: ltriplets = max {0, m + dcos(x1, x2) − dcos(x1, x3)} where dcos(x1, x2) = 1−cos(x1,x2)
2
is the cosine distance between x1 and x2, and m is a margin parameter. Pair (x1, x2) is same, (x1, x3) is different.
13 / 17
SLIDE 51 Evaluation of acoustic word embeddings
Downsampling Reference vector Word classifier CNN Siamese CNN 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Average precision
14 / 17
SLIDE 52 Evaluation of acoustic word embeddings
Downsampling Reference vector Word classifier CNN Siamese CNN 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Average precision But Siamese CNN still uses weak supervision. Still work to be done for unsupervised case, e.g. [Chung et al., IS’16].
14 / 17
SLIDE 53 Looking forward
15 / 17
SLIDE 54 Looking forward
- Much to be done in zero-resource speech processing
- Core issues: evaluation; what do we want to discover?
15 / 17
SLIDE 55 Looking forward
- Much to be done in zero-resource speech processing
- Core issues: evaluation; what do we want to discover?
- Do these models allow us to model language acquisition in human
infants?
15 / 17
SLIDE 56 Looking forward
- Much to be done in zero-resource speech processing
- Core issues: evaluation; what do we want to discover?
- Do these models allow us to model language acquisition in human
infants?
- Can these models be used for language
acquisition in robotic applications?
- Extensions to multiple modalities
15 / 17
SLIDE 57 Take-aways
- Unsupervised, or zero-resource, speech processing is an important
and cool problem
- Segmental acoustic word embeddings is a sensible way to approach
unsupervised segmentation and clustering, and is cool in general
- Interesting to look at speech problems from a different perspective:
allows you to play around with cool models, and get new insights
16 / 17
SLIDE 58 Poster: Better features using the correspondence autoencoder
17 / 17
SLIDE 59 Poster: Better features using the correspondence autoencoder
Two problems in zero-resource speech processing:
- 1. Unsupervised segmentation and clustering
17 / 17
SLIDE 60 Poster: Better features using the correspondence autoencoder
Two problems in zero-resource speech processing:
- 1. Unsupervised segmentation and clustering
- 2. Unsupervised frame-level representation learning:
fa(·) Cool model
17 / 17
SLIDE 61
Code: https://github.com/kamperh
SLIDE 62 References I
- Abdel-Hamid, O., Deng, L., Yu, D., and Jiang, H. (2013).
Deep segmental neural networks for speech recognition. In Proc. Interspeech.
- Bengio, S. and Heigold, G. (2014).
Word embeddings for speech recognition. In Proc. Interspeech.
- Besacier, L., Barnard, E., Karpov, A., and Schultz, T. (2014).
Automatic speech recognition for under-resourced languages: A survey. Speech Commun., 56:85–100.
- Bromley, J., Bentz, J. W., Bottou, L., Guyon, I., LeCun, Y., Moore,
C., S¨ ackinger, E., and Shah, R. (1993). Signature verification using a ‘Siamese’ time delay neural network.
- Int. J. Pattern Rec., 7(4):669–688.
SLIDE 63 References II
- Chen, G., Parada, C., and Sainath, T. N. (2015).
Query-by-example keyword spotting using long short-term memory networks. In Proc. ICASSP.
- Chung, Y.-A., Wu, C.-C., Shen, C.-H., and Lee, H.-Y. (2016).
Unsupervised learning of audio segment representations using sequence-to-sequence recurrent neural networks.
- Proc. Interspeech.
- Jansen, A. et al. (2013).
A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition. In Proc. ICASSP.
SLIDE 64 References III
- Kamper, H., Goldwater, S. J., and Jansen, A. (2015).
Fully unsupervised small-vocabulary speech recognition using a segmental Bayesian model. In Proc. Interspeech.
- Kamper, H., Jansen, A., and Goldwater, S. J. (2016a).
Unsupervised word segmentation and lexicon discovery using acoustic word embeddings. IEEE/ACM Trans. Audio, Speech, Language Process., 24(4):669–679.
- Kamper, H., Wang, W., and Livescu, K. (2016b).
Deep convolutional acoustic word embeddings using word-pair side information. In Proc. ICASSP.
SLIDE 65 References IV
- Lee, C.-y., O’Donnell, T., and Glass, J. R. (2015).
Unsupervised lexicon discovery from acoustic input.
- Trans. ACL, 3:389–403.
- Levin, K., Henry, K., Jansen, A., and Livescu, K. (2013).
Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings. In Proc. ASRU.
- Levin, K., Jansen, A., and Van Durme, B. (2015).
Segmental acoustic indexing for zero resource keyword search. In Proc. ICASSP.
- Lyzinski, V., Sell, G., and Jansen, A. (2015).
An evaluation of graph clustering methods for unsupervised term discovery. In Proc. Interspeech.
SLIDE 66 References V
- Maas, A. L., Miller, S. D., O’Neil, T. M., Ng, A. Y., and Nguyen, P.
(2012). Word-level acoustic modeling with convolutional vector regression. In Proc. ICML Workshop Representation Learn.
- Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).
Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
as¨ anen, O. J. (2012). Computational modeling of phonetic and lexical learning in early language acquisition: Existing models and future directions. Speech Commun., 54:975–997.
SLIDE 67 References VI
as¨ anen, O. J., Doyle, G., and Frank, M. C. (2015). Unsupervised word discovery from speech using automatic segmentation into syllable-like units. In Proc. Interspeech.
- Renkens, V. and Van hamme, H. (2015).
Mutually exclusive grounding for weakly supervised non-negative matrix factorisation. In Proc. Interspeech.
ere, R., Dunbar, E., Synnaeve, G., Versteegh, M., and Dupoux,
A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling. In Proc. Interspeech.
SLIDE 68 References VII
ere, R., Schatz, T., Cao, X. N., Anguera, X., Jansen, A., and Dupoux, E. (2015). The Zero Resource Speech Challenge 2015. In Proc. Interspeech.