Unsupervised neural and Bayesian models for zero-resource speech - PowerPoint PPT Presentation

Unsupervised neural and Bayesian models for zero-resource speech processing MIT CSAIL, 15 Nov. 2016 Herman Kamper University of Edinburgh; TTI at Chicago http://www.kamperh.com

Speech recognition success 1 / 35

Speech recognition success [Xiong et al., arXiv’16] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) 1 / 35

Speech recognition success [Xiong et al., arXiv’16] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • Data: 2000 hours of labelled speech audio; ∼ 350M words of text 1 / 35

Speech recognition success [Xiong et al., arXiv’16] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • Data: 2000 hours of labelled speech audio; ∼ 350M words of text • But: Can we do this for all 7000 languages spoken in the world? 1 / 35

Unsupervised speech processing Developing unsupervised methods that can learn structure directly from raw speech audio, i.e. zero-resource technology 2 / 35

Unsupervised speech processing Developing unsupervised methods that can learn structure directly from raw speech audio, i.e. zero-resource technology Criticism: Always some data; semi-supervised problem 2 / 35

Unsupervised speech processing Developing unsupervised methods that can learn structure directly from raw speech audio, i.e. zero-resource technology Criticism: Always some data; semi-supervised problem Reasons for purely unsupervised case: • Modelling infant language acquisition [R¨ as¨ anen, SpecCom’12] • Language acquisition in robotics [Renkens and Van hamme, IS’15] • Analysis of audio for unwritten languages [Besacier et al., SpecCom’14] • New insights and models for speech processing [Jansen et al., ICASSP’13] 2 / 35

Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : 3 / 35

Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : f a ( · ) 3 / 35

Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : Cool model f a ( · ) f a ( · ) 3 / 35

Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : Cool model f a ( · ) f a ( · ) 2. Unsupervised segmentation and clustering : How do we discover meaningful units in unlabelled speech? 3 / 35

Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 35

Full-coverage segmentation and clustering 5 / 35

Unsupervised speech processing: Two problems 1. Unsupervised frame-level Cool model representation learning : f a ( · ) 2. Unsupervised segmentation and clustering : We focus on full-coverage segmentation and clustering 6 / 35

Unsupervised speech processing: Two problems 1. Unsupervised frame-level Cool model representation learning : f a ( · ) 2. Unsupervised segmentation and clustering : We focus on full-coverage segmentation and clustering Our claim: Unsupervised speech processing benefits from both top-down and bottom-up modelling 6 / 35

Top-down and bottom-up modelling Top-down: Use knowledge of higher-level units to learn about lower-level parts Bottom-up: Piece together lower-level parts to get more complex higher-level structures [Feldman et al., CCSS’09] 7 / 35

Unsupervised frame-level representation learning: The Correspondence Autoencoder

Unsupervised frame-level representation learning: The Correspondence Autoencoder Micha Elsner Daniel Renshaw Aren Jansen Sharon Goldwater

Supervised representation learning using DNN Output: predict phone states ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks 9 / 35

Supervised representation learning using DNN Output: predict phone states ay ey k v Phone classifier learned jointly Feature extractor f a ( · ) learned from data Input: speech frame(s) e.g. MFCCs, filterbanks 9 / 35

Supervised representation learning using DNN Output: predict phone states ay ey k v Phone classifier learned jointly Unsupervised modelling: No phone class targets to train network on Feature extractor f a ( · ) learned from data Input: speech frame(s) e.g. MFCCs, filterbanks 9 / 35

Autoencoder (AE) neural network Reconstruct input Input speech frame [Badino et al., ICASSP’14] 10 / 35

Autoencoder (AE) neural network Reconstruct input • Completely unsupervised • But purely bottom-up • Can we use top-down information? Input speech frame [Badino et al., ICASSP’14] 10 / 35

Autoencoder (AE) neural network Reconstruct input • Completely unsupervised • But purely bottom-up • Can we use top-down information? • Idea: Unsupervised term discovery Input speech frame [Badino et al., ICASSP’14] 10 / 35

Unsupervised term discovery (UTD) 11 / 35

Unsupervised term discovery (UTD) Can we use these discovered word pairs to give weak top-down supervision? 11 / 35

Weak top-down supervision: Align frames [Jansen et al., ICASSP’13] 12 / 35

Autoencoder (AE) Reconstruct input Input speech frame 13 / 35

Correspondence autoencoder (cAE) Frame from other word in pair Frame from one word 14 / 35

Correspondence autoencoder (cAE) Frame from other word in pair Unsupervised feature extractor f a ( · ) Frame from one word 14 / 35

Correspondence autoencoder (cAE) Frame from other word in pair Combine top-down and bottom-up information Unsupervised feature extractor f a ( · ) Frame from one word 14 / 35

Correspondence autoencoder (cAE) Train correspondence (1) (4) Train stacked autoencoder autoencoder (pretraining) Initialize weights Speech corpus (3) Unsupervised feature extractor (2) Unsupervised term discovery Align word pair frames [Kamper et al., ICASSP’15] 15 / 35

Intrinsic evaluation: Isolated word query task 0 . 5 0 . 4 Average precision 0 . 3 0 . 2 0 . 1 0 . 0 Autoencoder UBM-GMM TopUBM cAE 16 / 35

Intrinsic evaluation: Isolated word query task 0 . 5 0 . 4 Average precision 0 . 3 0 . 2 0 . 1 0 . 0 Autoencoder UBM-GMM TopUBM cAE Extended: [Renshaw et al., IS’15] and [Yuan et al., IS’16] 16 / 35

Unsupervised segmentation and clustering: The Segmental Bayesian Model

Unsupervised segmentation and clustering: The Segmental Bayesian Model Aren Jansen Sharon Goldwater

Full-coverage segmentation and clustering 18 / 35

Segmental modelling for full-coverage segmentation Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015] : 19 / 35

Segmental modelling for full-coverage segmentation Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015] : Our approach uses whole-word segmental representations, i.e. acoustic word embeddings [Kamper et al., IS’15; Kamper et al., TASLP’16] 19 / 35

Acoustic word embeddings 20 / 35

Acoustic word embeddings x i ∈ R d in d -dimensional space f e ( Y 1 ) Y 1 Y 2 f e ( Y 2 ) 20 / 35

Acoustic word embeddings x i ∈ R d in d -dimensional space f e ( Y 1 ) Y 1 Y 2 f e ( Y 2 ) Dynamic programming alignment has quadratic complexity, while embedding comparison is linear time. Can use standard clustering. 20 / 35

Unsupervised segmental Bayesian model Speech waveform 21 / 35

Unsupervised segmental Bayesian model Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35

Unsupervised segmental Bayesian model Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35

Unsupervised segmental Bayesian model p ( x i | h − ) Bayesian Gaussian mixture model Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35

Unsupervised segmental Bayesian model p ( x i | h − ) Bayesian Gaussian mixture model Acoustic modelling Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35

Unsupervised neural and Bayesian models for zero-resource speech - PowerPoint PPT Presentation

Unsupervised neural and Bayesian models for zero-resource speech processing MIT CSAIL, 15 Nov. 2016 Herman Kamper University of Edinburgh; TTI at Chicago http://www.kamperh.com Speech recognition success 1 / 35 Speech recognition success 1

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

A Bayesian Approach to A Bayesian Approach to Unsupervised One- Unsupervised One -Shot Shot

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

Zero Waste at The Nat Zero Waste Zero Waste Zero Waste is a philosophy that encourages the

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Unsupervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and

Bayesian hierarchical models Bruno Nicenboim / Shravan Vasishth 2020-03-14 1 Bayesian

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Todays Menu I. Marx Review capitalist responses to decline Was Marx right? (cont.)

Learning probabilities over underlying representations Presented by Robert Staubs Joe Pater *

Typological consequences of agent interaction Coral Hughto Robert Staubs Joe Pater University

Sign constraints on feature weights improve a joint model of word segmentation and phonology Mark

Why she had to go migration and sex ratios Ilya Kashnitsky BSPS19 Ravenstein says: Fe

Strategies, and Tips Catholic Charities, Financial Stability Network Don Hathway, Financial

Trade and earnings inequality in middle-income countries Janneke Pieters (IZA - Institute for the

GoldenTrail : Retrieving the Data History that Matters from a Comprehensive Provenance Repository