Unsupervised speech processing using acoustic word embeddings - PowerPoint PPT Presentation

Unsupervised speech processing using acoustic word embeddings Herman Kamper School of Informatics, University of Edinburgh → TTI at Chicago MLSLP 2016: Spotlight invited talk

Unsupervised speech processing • Speech recognition applications are becoming wide-spread • Google Voice Search already supports more than 50 languages: English, Spanish, German, . . . , Afrikaans, Zulu 1 / 17

Unsupervised speech processing • Speech recognition applications are becoming wide-spread • Google Voice Search already supports more than 50 languages: English, Spanish, German, . . . , Afrikaans, Zulu • But there are roughly 7000 languages spoken in the world! • Audio data are becoming available, even for languages spoken by only a few speakers, but generally unlabelled 1 / 17

Unsupervised speech processing • Speech recognition applications are becoming wide-spread • Google Voice Search already supports more than 50 languages: English, Spanish, German, . . . , Afrikaans, Zulu • But there are roughly 7000 languages spoken in the world! • Audio data are becoming available, even for languages spoken by only a few speakers, but generally unlabelled • Goal: Unsupervised learning of linguistic structure directly from raw speech audio, in order to develop zero-resource speech technology 1 / 17

Motivation for unsupervised speech processing Criticism: • Always some labelled data to start with (e.g. related language) • Small set of labelled data: semi-supervised problem 2 / 17

Motivation for unsupervised speech processing Criticism: • Always some labelled data to start with (e.g. related language) • Small set of labelled data: semi-supervised problem Reasons for focusing on purely unsupervised case: • Modelling infant language acquisition [R¨ as¨ anen, 2012] • Language acquisition in robotics [Renkens and Van hamme, 2015] • Practical use of zero-resource technology: Allow linguists to analyze and investigate unwritten languages [Besacier et al., 2014] • New insights and models for speech processing: E.g. unsupervised methods can improve supervised systems [Jansen et al., 2012] 2 / 17

Unsupervised segmentation and clustering Full-coverage segmentation: 3 / 17

Segmental modelling for full-coverage segmentation 4 / 17

Segmental modelling for full-coverage segmentation Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015] : 4 / 17

Segmental modelling for full-coverage segmentation Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015] : Our approach uses whole-word segmental representations, i.e. acoustic word embeddings [Kamper et al., IS’15; Kamper et al., TASLP’16] 4 / 17

Acoustic word embeddings 5 / 17

Acoustic word embeddings x i ∈ R d in d -dimensional space f e ( Y 1 ) Y 1 Y 2 f e ( Y 2 ) 5 / 17

Acoustic word embeddings x i ∈ R d in d -dimensional space f e ( Y 1 ) Y 1 Y 2 f e ( Y 2 ) Dynamic programming alignment has quadratic complexity, while embedding comparison is linear time. Can use standard clustering. 5 / 17

An unsupervised segmental Bayesian model Speech waveform 6 / 17

An unsupervised segmental Bayesian model Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 6 / 17

An unsupervised segmental Bayesian model Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 6 / 17

An unsupervised segmental Bayesian model p ( x i | h − ) Bayesian Gaussian mixture model Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 6 / 17

An unsupervised segmental Bayesian model p ( x i | h − ) Bayesian Gaussian mixture model Acoustic modelling Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 6 / 17

An unsupervised segmental Bayesian model p ( x i | h − ) Bayesian Gaussian mixture model Word segmentation Acoustic modelling Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 6 / 17

Applied to a small-vocabulary task 7 / 17

Applied to a small-vocabulary task one two three Ground truth type four five six seven eight nine oh zero 33 12 47 60 66 27 83 51 92 38 63 14 89 24 85 Cluster ID 7 / 17

Applied to a large-vocabulary task 8 / 17

Applied to a large-vocabulary task 70 ZRSBaselineUTD (SI) 60 UTDGraphCC (SI) SyllableSegOsc + (SD) 50 BayesSegMinDur-MFCC (SD) BayesSegMinDur-cAE (SI) F -score (%) 40 30 20 10 0 n e y p r e a k y d o T n T u o B ZRSBaselineUTD: [Versteegh et al., 2015]; UTDGraphCC: [Lyzinski et al., 2015]; SyllableSegOsc + : [R¨ as¨ anen et al., 2015] 8 / 17

Acoustic word embeddings x i ∈ R d in d -dimensional space f e ( Y 1 ) Y 1 Y 2 f e ( Y 2 ) 9 / 17

Acoustic word embeddings Useful for more than just unsupervised modelling 10 / 17

Acoustic word embeddings Useful for more than just unsupervised modelling • Segmental conditional random field ASR [Maas et al., 2012] : ran, f 1 =1 Andrew, f 1 =0 • Whole-word lattice rescoring [Bengio and Heigold, 2014] • Query-by-example search, e.g. [Chen et al., 2015] for “Okay Google”: 10 / 17

Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] 11 / 17

Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] w i 0 0 0 · · · 1 · · · 0 0 Y i 11 / 17

Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 Y i 11 / 17

Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 convolution Y i 11 / 17

Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 max convolution Y i 11 / 17

Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 × n conv max convolution Y i 11 / 17

Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 connected fully × n conv max convolution Y i 11 / 17

Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 connected × n full fully × n conv max convolution Y i 11 / 17

Word classification CNN Fully supervised approach [Bengio and Heigold, 2014] softmax w i x i = f e ( Y i ) 0 0 0 · · · 1 · · · 0 0 connected × n full fully × n conv max convolution Y i 11 / 17

Word similarity Siamese CNN Weak supervision we sometimes have [Thiolli` ere et al., 2015] are known word pairs: S train = { ( m , n ) : ( Y m , Y n ) are of the same type } 12 / 17

Word similarity Siamese CNN Weak supervision we sometimes have [Thiolli` ere et al., 2015] are known word pairs: S train = { ( m , n ) : ( Y m , Y n ) are of the same type } Use idea of Siamese networks [Bromley et al., 1993] 12 / 17

Word similarity Siamese CNN Weak supervision we sometimes have [Thiolli` ere et al., 2015] are known word pairs: S train = { ( m , n ) : ( Y m , Y n ) are of the same type } Use idea of Siamese networks [Bromley et al., 1993] x 1 = f e ( Y 1 ) x 2 = f e ( Y 2 ) Y 1 Y 2 12 / 17

Word similarity Siamese CNN Weak supervision we sometimes have [Thiolli` ere et al., 2015] are known word pairs: S train = { ( m , n ) : ( Y m , Y n ) are of the same type } Use idea of Siamese networks [Bromley et al., 1993] distance l ( x 1 , x 2 ) x 1 = f e ( Y 1 ) x 2 = f e ( Y 2 ) Y 1 Y 2 12 / 17

Unsupervised speech processing using acoustic word embeddings - PowerPoint PPT Presentation

Unsupervised speech processing using acoustic word embeddings Herman Kamper School of Informatics, University of Edinburgh TTI at Chicago MLSLP 2016: Spotlight invited talk Unsupervised speech processing Speech recognition applications

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Recognition Intro Acoustic modelling HMMs Speech

Improving Unsupervised Acoustic Word Embeddings using Speaker and Gender Information Lisa van

Acoustic Acoustic Control Systems BV Acoustic Acoustic Control Systems BV Control Systems BV

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Adaptation Techniques for Acoustic Adaptation Techniques for Acoustic Adaptation Techniques for

Speech Processing for Speech Processing for Unwritten Languages Unwritten Languages Alan W

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Best Practice Transfer Guidelines Home Birth Summit Collaboration Task Force

The Virtual Immunization Communication (VIC) Network is a project of the National Public Health

VenomSeq A platform for drug discovery from animal venoms using di ff erential gene expression

Debian Security An overview of features and processes Debian Security Todd Troxell

Underwater Acoustic Communication Channel Simulation Using Parabolic Equation Aijun Song Joseph

Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural

Bag-of-Features Acoustic Event Detection for Sensor Networks Julian K urby, Ren e Grzeszick,

Multichannel Raw-Waveform Neural Network Acoustic Models Tara N. Sainath December 17, 2017 (in