 
              Unsupervised speech processing using acoustic word embeddings Herman Kamper School of Informatics, University of Edinburgh → TTI at Chicago MLSLP 2016: Spotlight invited talk
Unsupervised speech processing • Speech recognition applications are becoming wide-spread • Google Voice Search already supports more than 50 languages: English, Spanish, German, . . . , Afrikaans, Zulu 1 / 17
Unsupervised speech processing • Speech recognition applications are becoming wide-spread • Google Voice Search already supports more than 50 languages: English, Spanish, German, . . . , Afrikaans, Zulu • But there are roughly 7000 languages spoken in the world! • Audio data are becoming available, even for languages spoken by only a few speakers, but generally unlabelled 1 / 17
Unsupervised speech processing • Speech recognition applications are becoming wide-spread • Google Voice Search already supports more than 50 languages: English, Spanish, German, . . . , Afrikaans, Zulu • But there are roughly 7000 languages spoken in the world! • Audio data are becoming available, even for languages spoken by only a few speakers, but generally unlabelled • Goal: Unsupervised learning of linguistic structure directly from raw speech audio, in order to develop zero-resource speech technology 1 / 17
Motivation for unsupervised speech processing Criticism: • Always some labelled data to start with (e.g. related language) • Small set of labelled data: semi-supervised problem 2 / 17
Motivation for unsupervised speech processing Criticism: • Always some labelled data to start with (e.g. related language) • Small set of labelled data: semi-supervised problem Reasons for focusing on purely unsupervised case: • Modelling infant language acquisition [R¨ as¨ anen, 2012] • Language acquisition in robotics [Renkens and Van hamme, 2015] • Practical use of zero-resource technology: Allow linguists to analyze and investigate unwritten languages [Besacier et al., 2014] • New insights and models for speech processing: E.g. unsupervised methods can improve supervised systems [Jansen et al., 2012] 2 / 17
Unsupervised segmentation and clustering Full-coverage segmentation: 3 / 17
Unsupervised segmentation and clustering Full-coverage segmentation: 3 / 17
Unsupervised segmentation and clustering Full-coverage segmentation: 3 / 17
Segmental modelling for full-coverage segmentation 4 / 17
Segmental modelling for full-coverage segmentation Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015] : 4 / 17
Segmental modelling for full-coverage segmentation Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015] : Our approach uses whole-word segmental representations, i.e. acoustic word embeddings [Kamper et al., IS’15; Kamper et al., TASLP’16] 4 / 17
Acoustic word embeddings 5 / 17
Acoustic word embeddings x i ∈ R d in d -dimensional space f e ( Y 1 ) Y 1 Y 2 f e ( Y 2 ) 5 / 17
Acoustic word embeddings x i ∈ R d in d -dimensional space f e ( Y 1 ) Y 1 Y 2 f e ( Y 2 ) Dynamic programming alignment has quadratic complexity, while embedding comparison is linear time. Can use standard clustering. 5 / 17
An unsupervised segmental Bayesian model Speech waveform 6 / 17
An unsupervised segmental Bayesian model Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 6 / 17
An unsupervised segmental Bayesian model Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 6 / 17
An unsupervised segmental Bayesian model Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 6 / 17
An unsupervised segmental Bayesian model p ( x i | h − ) Bayesian Gaussian mixture model Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 6 / 17
An unsupervised segmental Bayesian model p ( x i | h − ) Bayesian Gaussian mixture model Acoustic modelling Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 6 / 17
An unsupervised segmental Bayesian model p ( x i | h − ) Bayesian Gaussian mixture model Word segmentation Acoustic modelling Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 6 / 17
Applied to a small-vocabulary task 7 / 17
Applied to a small-vocabulary task one two three Ground truth type four five six seven eight nine oh zero 33 12 47 60 66 27 83 51 92 38 63 14 89 24 85 Cluster ID 7 / 17
Applied to a large-vocabulary task 8 / 17
Applied to a large-vocabulary task 70 ZRSBaselineUTD (SI) 60 UTDGraphCC (SI) SyllableSegOsc + (SD) 50 BayesSegMinDur-MFCC (SD) BayesSegMinDur-cAE (SI) F -score (%) 40 30 20 10 0 n e y p r e a k y d o T n T u o B ZRSBaselineUTD: [Versteegh et al., 2015]; UTDGraphCC: [Lyzinski et al., 2015]; SyllableSegOsc + : [R¨ as¨ anen et al., 2015] 8 / 17
Acoustic word embeddings x i ∈ R d in d -dimensional space f e ( Y 1 ) Y 1 Y 2 f e ( Y 2 ) 9 / 17
Acoustic word embeddings Useful for more than just unsupervised modelling 10 / 17
Acoustic word embeddings Useful for more than just unsupervised modelling • Segmental conditional random field ASR [Maas et al., 2012] : ran, f 1 =1 Andrew, f 1 =0 • Whole-word lattice rescoring [Bengio and Heigold, 2014] • Query-by-example search, e.g. [Chen et al., 2015] for “Okay Google”: 10 / 17
Acoustic word embeddings Useful for more than just unsupervised modelling • Segmental conditional random field ASR [Maas et al., 2012] : ran, f 1 =1 Andrew, f 1 =0 • Whole-word lattice rescoring [Bengio and Heigold, 2014] • Query-by-example search, e.g. [Chen et al., 2015] for “Okay Google”: 10 / 17
Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] 11 / 17
Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] w i 0 0 0 · · · 1 · · · 0 0 Y i 11 / 17
Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 Y i 11 / 17
Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 Y i 11 / 17
Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 Y i 11 / 17
Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 Y i 11 / 17
Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 Y i 11 / 17
Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 convolution Y i 11 / 17
Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 convolution Y i 11 / 17
Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 convolution Y i 11 / 17
Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 max convolution Y i 11 / 17
Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 × n conv max convolution Y i 11 / 17
Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 connected fully × n conv max convolution Y i 11 / 17
Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 connected × n full fully × n conv max convolution Y i 11 / 17
Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i x i = f e ( Y i ) 0 0 0 · · · 1 · · · 0 0 connected × n full fully × n conv max convolution Y i 11 / 17
Word similarity Siamese CNN Weak supervision we sometimes have [Thiolli` ere et al., 2015] are known word pairs: S train = { ( m , n ) : ( Y m , Y n ) are of the same type } 12 / 17
Word similarity Siamese CNN Weak supervision we sometimes have [Thiolli` ere et al., 2015] are known word pairs: S train = { ( m , n ) : ( Y m , Y n ) are of the same type } Use idea of Siamese networks [Bromley et al., 1993] 12 / 17
Word similarity Siamese CNN Weak supervision we sometimes have [Thiolli` ere et al., 2015] are known word pairs: S train = { ( m , n ) : ( Y m , Y n ) are of the same type } Use idea of Siamese networks [Bromley et al., 1993] x 1 = f e ( Y 1 ) x 2 = f e ( Y 2 ) Y 1 Y 2 12 / 17
Word similarity Siamese CNN Weak supervision we sometimes have [Thiolli` ere et al., 2015] are known word pairs: S train = { ( m , n ) : ( Y m , Y n ) are of the same type } Use idea of Siamese networks [Bromley et al., 1993] distance l ( x 1 , x 2 ) x 1 = f e ( Y 1 ) x 2 = f e ( Y 2 ) Y 1 Y 2 12 / 17
Recommend
More recommend