Machine Learning in Speech Recognition Chao Zhang 7 March 2013 - PowerPoint PPT Presentation

Machine Learning in Speech Recognition Chao Zhang 7 March 2013 Cambridge University Engineering Department Machine Learning Research and Communication Club, March 2013

Toshiba Presentation Overview • Characteristics of the Speech Signal – A continuous-valued time series generated by encoding various of excitation with a complex time-varying non-linear filter. – various kinds of energy excited by • Multi-Class Extensions – combining binary SVMs – multi-class SVMs • Structured SVMs for Continuous Speech Recognition – joint feature spaces for structured modelling – large margin training – relationship with other models – lattice based implementation Cambridge University Machine Learning Research and Communication Club, March 2013 1 Engineering Department

Toshiba Presentation Characteristics of the Speech Signal • A continuous-valued time series generated by encoding various of excitation with a complex time-varying non-linear filter. – Continuous-valued: impact on our choice of models and need to be careful with the numerical computation. – Time series: the model need to be able to represent this, and the training and decoding efficiencies are often of concern. – Speech signals are presented in the form of rapidly-varying functions. • Speech signals produced by humans are often pre-processed with signal processing methods and used as the input features to the automatic speech recognition (ASR) system. – ASR need to handle the variability of humans: coarticulation, time-varying (mood, aging, ...), gender, accent, and etc . – ASR need to face difficulties existed in the other signal processing methods: channel variations, noise, .... Cambridge University Machine Learning Research and Communication Club, March 2013 2 Engineering Department

Toshiba Presentation Resources Available for Building ASR • Phonetic knowledge characterizing how phones are produced with articulator movements. – Some rules need to be verified across a large amount of speakers. – State-of-the-art ASR often adopts statistic models trained with a large amount of speech data (e.g., 3000 hours – 1.08G samples). • Lexical and syntax knowledge is available for a given language and can aid speech recognition. – Our-of-vocabulary words. – Ill-formed sentences. Cambridge University Machine Learning Research and Communication Club, March 2013 3 Engineering Department

Toshiba Presentation Some Basis of Stochastic ASR • Continuous speech signals are sampled to discrete waveforms, then compressed to a sequence of individual speech frames according to the short-time stationary property (10 ∼ 30ms/sec), assuming the vocal tract is time-invariant. • Source-filter model based on maximum a posteriori criterion, ˆ P ( w | O ) ∝ arg max P ( O | w ) P ( w ) . w = arg max w w – O refers to the input speech frame sequence, w refers to the word sequence. – P ( w ) and P ( O | w ) are called the language model and the acoustic model. – arg max w is to decode for the most likely hypothesis. • Hidden Markov Models (HMMs) are most commonly used under the framework. Cambridge University Machine Learning Research and Communication Club, March 2013 4 Engineering Department

Toshiba Presentation (Cont. Density) Hidden Markov Models • The sound of a phonetic unit can often be divided into several states, denoted as s , according to its production procedure. Assume s is 1st-order Markovian, T � P ( s ) = P ( q t = s t | , q t − 1 = s t − 1 ) . t =1 • It is sensible to regard the phone as produced by another process associated to s . Let us assume the process only depends on the current state, i.e., T T � � P ( O | s ) = P ( o t | s ) = P ( o t | q t = s t ) . t =1 t =1 Cambridge University Machine Learning Research and Communication Club, March 2013 5 Engineering Department

Toshiba Presentation (Cont. Density) Hidden Markov Models (Cont.) • Now we have a HMM, denoted it as λ , � P ( O | λ ) = P ( O | s , λ ) P ( s | λ ) s • In ASR, we usually use constant transition probabilities between different states, denoted as P ( q t = s t | , q t − 1 = s t − 1 ) = a t − 1 ,t . • Modern ASR uses continuous density to model the observation probabilities. Assuming the frames belong to a certain state are i.i.d, Gaussian mixture models are commonly used to approach any continuous density associated with that state by any precision, i.e., M � b j ( o t ) = P ( o t | q t = j ) = c jm N ( o t , µ jm , Σ jm ) . m =1 Cambridge University Machine Learning Research and Communication Club, March 2013 6 Engineering Department

Toshiba Presentation HMM Acoustic Models & Decoding • A set of acoustic models contains HMMs relevant to every phone (syllable, word, and etc .) of the target langauge. !"#"$ -./+"0$'*' ZH[4] ZH[3] AA[3] AA[2] %&'$()#"*+, 1 2 T-2 T-1 T HMM: AA HMM: ZH • Modern ASRs use a tuple of concatenated phones rather than a single phone to build an HMM, to capture coarticulation changes inter/intra words (e.g., triphone: ‘IY’ ‘T’ ‘CH’ ‘IY’ ‘Z’ → ‘sil’+‘IY’-‘T’ ‘IY’+‘T’-‘CH’ . . . ) – Relevant states to triphone HMMs with the same central unit are often clustered to avoid data sparseness and reduce system complexity. Cambridge University Machine Learning Research and Communication Club, March 2013 7 Engineering Department

Toshiba Presentation (Deep) Neural Networks in ASR • To our knowledge, DNN applications in ASR (in addition to LM) include 3 aspects: – Acoustic models: use the pseudo posteriors from DNN to obtain the observation probabilities. – Tandem feature detectors: to extract discriminative neural net features and use them together with the original observations. – Speech attribute detectors: use DNNs to extract a set of asynchronous speech attributes. • The DNN most commonly used in ASR is deep feedforward NNs (expect for LM, where people also use deep recurrent NNs). • The training approaches in use include: – Layer-wised generative pre-training (RBM and etc. ) – Layer-wised discriminative pre-training. – Normalized random initialization. – 2nd-order optimization. Cambridge University Machine Learning Research and Communication Club, March 2013 8 Engineering Department

Toshiba Presentation DNN-HMM Acoustic Models • A DNN with phone or tied-state targets ( √ ) is fitted into HMM acoustic models by converting the pseudo posteriors into the observation probabilities, ln P ( o t | s t ) = ln P ( s t | o t ) − ln P ( s t ) + C, where C is a negative constant, C ∝ ln P ( o t ) . • Comparing DNN-HMM acoustic models to GMM-HMM acoustic models, – GMMs are trained generatively (needs an additional pass of discriminative training to be discriminatively), individually, and sequentially. – A DNN is trained discriminatively and globally on frame-level (also can be trained on sequence level by back-propagating the statistics generated and collected using sequential criterion). – A DNN can take the observations of several concatenated frames as the input directly, utilizing the context information. Cambridge University Machine Learning Research and Communication Club, March 2013 9 Engineering Department

Toshiba Presentation Tandem Feature Detectors • The way of using tandem features: – Extract neural net features. – Combine the neural net features with the original input observations. – De-correlate and reduce the dimensions of the tandem features. – Use tandem features rather than the original observations as the input to the diagonal GMM-HMM acoustic models. • Different kinds of DNN features: – DNN output posteriors: phone posteriors and tied-state posteriors. – Bottleneck DNN: build a DNN (either phone or tied-state targets) with a bottlenecked hidden layer; use the linear output of the bottleneck layer as the DNN features. • GMM-HMM systems with DNN (tied-state posteriors) bottlenecked tandem features are reported to have comparable performance to DNN-HMM systems. Cambridge University Machine Learning Research and Communication Club, March 2013 10 Engineering Department

Toshiba Presentation Speech Attribute Detectors • Some researchers claim the linear-chain structure of HMMs is not suitable to cover speech variations, and it may ignore some useful knowledge. Therefore proposed to use detection-based system. – Extract and utilize various of features from the speech signals based on prior knowledge from linguistics, signal processing, neuroscience, . . . – To use more complex model and system structure. – The accuracy of detectors was a key factor impact on the performance. Refine Prob. Prob. Prob. '#&-$% Phone Syllable Attribute Hypotheses Speech Signal .&#/ Phone to Syllable !""#$%"&% Lattice Lattice Lattice 01""23* Evidence '(&)*% Syllable to Word !"#$% Verifier Speech +*#,*# Merger Merger Attribute Detectors Knowledge Source, Models, Data, and Tools • Recent studies utilized DNN to detect articulation derived speech attributes, and got good results. Cambridge University Machine Learning Research and Communication Club, March 2013 11 Engineering Department

Machine Learning in Speech Recognition Chao Zhang 7 March 2013 - PowerPoint PPT Presentation

Machine Learning in Speech Recognition Chao Zhang 7 March 2013 Cambridge University Engineering Department Machine Learning Research and Communication Club, March 2013 Toshiba Presentation Overview Characteristics of the Speech Signal

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Effective Open Source Speech Recognition in Your Application #kde-speech Peter Grasch

Speech Processing 15-492/18-492 Speech Recognition Intro Acoustic modelling HMMs Speech

LatentGNN: Learning Efficient Non-local Relations for Visual Recognition Songyang Zhang, Shipeng

CS 403X Mobile and Ubiquitous Computing Lecture 12: Activity Recognition Emmanuel Agu Activity

Billions of Dollars? Is It All Valuation Voodoo? Who Wants to Be a Billionaire? How Are Startups

1. Good evening everyone and thank you for turning out for this the fifth public meeting. It is

3/10/2010 Investment Market Review TOPICS TO COVER Investment Market Review Trends in

Rupert Report Peter Rupert Professor Department of Economics, UCSB Director, UCSB Economic

GOING BEYOND RAD Redevelopment Challenges and Opportunities for Public Housing Claudia Brodie,

Implementing a Basic Income Guarantee in Canada: Prospects and Problems Robin Boadway Queens