ELEN E6884 - Topics in Signal Processing Recap Topic: Speech - PowerPoint PPT Presentation

✟ ✂ ✆ ☛ ✠✡ ✄☎ ✝✞ � ✁ Outline of Today’s Lecture ELEN E6884 - Topics in Signal Processing ■ Recap Topic: Speech Recognition ■ Gaussian Mixture Models - A ■ Gaussian Mixture Models - B Lecture 3 ■ Introduction to Hidden Markov Models Stanley F . Chen, Michael A. Picheny, and Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY, USA stanchen@us.ibm.com, picheny@us.ibm.com, bhuvana@us.ibm.com 22 September 2009 EECS E6870: Advanced Speech Recognition EECS E6870: Advanced Speech Recognition 1 Where are We? Administrivia ■ main feedback from last lecture ■ Can extract feature vectors over time - LPC, MFCC, or PLPs ● EEs: Speed ok - that characterize the information in a speech signal in a ● CSs: Hard to follow relatively compact form. ■ Can perform simple speech recognition by ■ Remedy: Only one more lecture will have serious signal ● Building templates consisting of sequences of feature vectors processing content so don’t worry! extracted from a set of words ■ Lab 1 due Sept 30 (don’t wait until the last minute!) ● Comparing the feature vectors for a new utterance against all the templates using DTW and picking the best scoring template ■ Learned about some basic concepts (e.g., graphs, distance measures, shortest paths) that will appear over and over again throughout the course EECS E6870: Advanced Speech Recognition 2 EECS E6870: Advanced Speech Recognition 3

� ✆ ☛ ✠✡ ✟ ✄☎ ✁ ✂ ✝✞ What are the Pros and Cons of DTW Pros ■ Easy to implement and compute ■ Lots of freedom - can model arbitrary time warpings EECS E6870: Advanced Speech Recognition 4 EECS E6870: Advanced Speech Recognition 5 How can we Do Better? Cons ■ Distance measures completely heuristic. ■ Key insight 1: Learn as much as possible from data - Distance ● Why Euclidean? Are all dimensions of the feature vector measure, weights on graph, even graph structure itself (future research) created equal? ■ Key insight 2: Use well-described theories and models from ■ Warping paths heuristic ● Too much freedom is not always a good thing for robustness probability, statistics, and computer science to describe the ● Allowable path moves all hand-derived data rather than developing new heuristics with ill-defined mathematical properties ■ No guarantees of optimality or convergence ■ Start by modeling the behavior of the distribution of feature vectors associated with different speech sounds leading to a particular set of models called Gaussian Mixture Models - a formalism of the concept of the distance measure. ■ Then derive models for describing the time evolution of feature vectors for speech sounds and words, called Hidden Markov Models, a generalization of the template idea in DTW. EECS E6870: Advanced Speech Recognition 6 EECS E6870: Advanced Speech Recognition 7

✄☎ ✟ ✠✡ ✆ � ✝✞ ✂ ☛ ✁ Gaussian Mixture Model Overview How do we Capture Variability? ■ Motivation for using Gaussians ■ Univariate Gaussians ■ Multivariate Gaussians ■ Estimating parameters for Gaussian Distributions ■ Need for Mixtures of Gaussians ■ Estimating parameters for Gaussian Mixtures ■ Initialization Issues ■ How many Gaussians? EECS E6870: Advanced Speech Recognition 8 EECS E6870: Advanced Speech Recognition 9 Data Models The Gaussian Distribution A lot of different types of data are distributed like a “bell-shaped EECS E6870: Advanced Speech Recognition 10 EECS E6870: Advanced Speech Recognition 11

✟ ✆ ✠✡ ✝✞ � ✄☎ ✁ ☛ ✂ curve”. Mathematically we can represent this by what is called a Advantages of Gaussian Distributions Gaussian or Normal distribution: ■ Central Limit Theorem: Sums of large numbers of identically 2 πσ e − ( O − µ )2 1 distributed random variables tend to Gaussian N ( µ, σ ) = √ 2 σ 2 ■ The sums and differences of Gaussian random variables are µ is called the mean and σ 2 is called the variance. The value at also Gaussian ■ If X is distributed as N ( µ, σ ) then aX + b is distributed as a particular point O is called the likelihood . The integral of the N ( aµ + b, ( aσ ) 2 ) above distribution is 1 : � ∞ 2 πσ e − ( O − µ )2 1 2 σ 2 dO = 1 √ −∞ It is often easier to work with the logarithm of the above: √ 2 πσ − ( O − µ ) 2 − ln 2 σ 2 which looks suspiciously like a weighted Euclidean distance! EECS E6870: Advanced Speech Recognition 12 EECS E6870: Advanced Speech Recognition 13 Gaussians in Two Dimensions ( O 1 − µ 1)2 σ 1 σ 2 + ( O 2 − µ 2)2 � � − 2 rO 1 O 2 1 1 − 2(1 − r 2) σ 2 σ 2 N ( µ 1 , µ 2 , σ 1 , σ 2 ) = √ 1 2 1 − r 2 e 2 πσ 1 σ 2 If r = 0 can write the above as − ( O 1 − µ 1)2 − ( O 2 − µ 2)2 1 1 2 σ 2 2 σ 2 √ √ 1 2 e e 2 πσ 1 2 πσ 2 EECS E6870: Advanced Speech Recognition 14 EECS E6870: Advanced Speech Recognition 15

✄☎ � ✆ ✝✞ ✂ ✁ ✟ ✠✡ ☛ If we write the following matrix: For most problems we will encounter in speech recognition, we will assume that Σ is diagonal so we may write the above as: σ 2 � � rσ 1 σ 2 � 1 � Σ = � σ 2 � n n rσ 1 σ 2 − n ln σ i − 1 � 2 � � � ( O i − µ i ) 2 /σ 2 2 ln(2 π ) − i 2 i =1 i =1 using the notation of linear algebra, we can write Again, note the similarity to a weighted Euclidean distance. 1 2 ( O − µ ) T Σ − 1 ( O − µ ) (2 π ) n/ 2 | Σ | 1 / 2 e − 1 N ( µ, Σ ) = where O = ( O 1 , O 2 ) and µ = ( µ 1 , µ 2 ) . More generally, µ and Σ can have arbitrary numbers of components, in which case the above is called a multivariate Gaussian. We can write the logarithm of the multivariate likelihood of the Gaussian as: 2 ln(2 π ) − 1 2 ln | Σ | − 1 − n 2( O − µ ) T Σ − 1 ( O − µ ) EECS E6870: Advanced Speech Recognition 16 EECS E6870: Advanced Speech Recognition 17 Estimating Gaussians Maximum-Likelihood Estimation Given a set of observations O 1 , O 2 , . . . , O N it can be shown that For simplicity, we will assume a univariate Gaussian. We can write µ and Σ can be estimated as: the likelihood of a string of observations O N 1 = O 1 , O 2 , . . . , O N as the product of the individual likelihoods: N µ = 1 � O i N i =1 N 2 πσ e − ( Oi − µ )2 1 � L ( O N 1 | µ, σ ) = √ 2 σ 2 and i =1 N Σ = 1 � ( O i − µ ) T ( O i − µ ) N It is much easier to work with L = ln L : i =1 How do we actually derive these formulas? N ( O i − µ ) 2 2 ln 2 πσ 2 − 1 1 | µ, σ ) = − N L ( O N � 2 σ 2 i =1 To find µ and σ we can take the partial derivatives of the above EECS E6870: Advanced Speech Recognition 18 EECS E6870: Advanced Speech Recognition 19

✟ � ☛ ✠✡ ✆ ✄☎ ✝✞ ✂ ✁ expressions: Problems with Gaussian Assumption N ∂L ( O N 1 | µ, σ ) ( O i − µ ) � = (1) σ 2 ∂µ i =1 N ∂L ( O N ( O i − µ ) 2 1 | µ, σ ) − N � = 2 σ 2 + (2) ∂σ 2 σ 4 i =1 By setting the above terms equal to zero and solving for µ and σ we obtain the classic formulas for estimating the means What can we do? Well, in this case, we can try modeling this with and variances. Since we are setting the parameters based on two Gaussians: maximizing the likelihood of the observations, this process is called Maximum-Likelihood Estimation, or just ML estimation. − ( O − µ 1)2 − ( O − µ 2)2 1 1 2 σ 2 2 σ 2 L ( O ) = p 1 √ e + p 2 √ e 1 2 2 πσ 1 2 πσ 2 where p 1 + p 2 = 1 . EECS E6870: Advanced Speech Recognition 20 EECS E6870: Advanced Speech Recognition 21 More generally, we can use an arbitrary number of Gaussians: Issues with ML Estimation of GMMs − ( O − µi )2 How many Gaussians? (to be discussed later....) 1 2 σ 2 � p i e i √ Infinite solutions: For the two-mixture case above, we can write 2 πσ i i the overall log likelihood of the data as: this is generally referred to as a Mixture of Gaussians or a   − ( Oi − µ 1)2 − ( Oi − µ 2)2 N Gaussian Mixture Model or GMM . Essentially any distribution of 1 1 � 2 σ 2 2 σ 2  p 1 e + p 2 e ln 1 2 √ √ interest can be modeled with GMMs.  2 πσ 1 2 πσ 2 i =1 Say we set µ 1 = O 1 . We can then write the above as   ( O 1 − µ 2)2 N 1 1 1 2 σ 2 �  + e . . . ln + 2 √ √  2 2 πσ 1 2 2 πσ 2 i =2 which clearly goes to ∞ as σ 1 → 0 . Empirically we can restrict our attention to the finite local maxima of the likelihood function. EECS E6870: Advanced Speech Recognition 22 EECS E6870: Advanced Speech Recognition 23

ELEN E6884 - Topics in Signal Processing Recap Topic: Speech - PowerPoint PPT Presentation

Outline of Todays Lecture ELEN E6884 - Topics in Signal Processing Recap Topic: Speech Recognition Gaussian Mixture Models - A Gaussian Mixture Models - B Lecture 3

ELEN E6884/COMS 86884 Speech Recognition Michael Picheny, Ellen Eide, Stanley F. Chen IBM T.J.

ELEN E6884/COMS 86884 Speech Recognition Lecture 7 Michael Picheny, Ellen Eide, Stanley F. Chen

ELEN E6884/COMS 86884 Speech Recognition Lecture 11 Michael Picheny, Ellen Eide, Stanley F. Chen

ELEN E6884/COMS 86884 Speech Recognition Lecture 8 Michael Picheny, Ellen Eide, Stanley F. Chen

ELEN E6884/COMS 86884 Speech Recognition Lecture 2 Michael Picheny, Ellen Eide, Stanley F. Chen

ELEN E6884 - Topics in Signal Processing Topic: Speech Recognition Lecture 9 Stanley F . Chen,

ELEN E6884 - Topics in Signal Processing Topic: Speech Recognition Lecture 9 Stanley F. Chen,

Signal Processing - Introduction Signal Processing Analogue/digital filters: extensively used

Digital Signal Processing Solutions Digital Signal Processing Solutions SIGNAL PROCESSING

Speech Processing 15-492/18-492 Speech Synthesis Signal Processing Signal Manipulation Signal

Tx Signal: 1000 Hz sine wave; Attenuation; Random noise with 0.5ms spike Tx Signal Noise Rx

Waveform Generation Fundamental part of signal processing is the signal. Within the

Advanced Digital Signal Processing Part 5: Multi-Rate Digital Signal Processing Gerhard Schmidt

VLSI Digital Signal Processing Systems Keshab K. Parhi VLSI Digital Signal Processing Systems

Signal Processing in MATLAB Signal Processing in MATLAB February 2, 1998 Tom Krauss PhD Student

Efficient audio signal processing using LLVM and Haskell Henning Thielemann 2013-04-30

Tokyo, September 14, 2020 OUTLINE OF THE MAIN REPORT SECTION 1 MACROECONOMIC PERFORMANCE AND

Diving Group RNLN Dive into the future with the RNLN Royal Netherlands Navy Anton van Dijk 19

( UEMEd or the Company) PROPOSED DISPOSAL OF 61.20% EQUITY INTEREST IN OPUS

OpenID Connect fredag 7 september 12 OpenID Connect fredag 7 september 12 Necessity for

Next Generation Is Islamic Fin inance Business Model - The Islamic Finance House - Facts

Distributed OLAP Databases Lecture # 24 Database Systems Andy Pavlo AP AP Computer Science

15-721 DATABASE SYSTEMS Lecture #09 OLAP Indexes Andy Pavlo / / Carnegie Mellon University

Advanced SQL II Advanced Aggregation and OLAP 5DV120 Database System Principles Ume a