Lecture 10 Discriminative Training, ROVER, and Consensus Michael - PowerPoint PPT Presentation

Lecture 10 Discriminative Training, ROVER, and Consensus Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen}@us.ibm.com 10 December 2012

General Motivation The primary goal for a speech recognition system is to accurately recognize the words. The modeling and adaptation techniques we have studied till now implicitly address this goal. Today we will focus on techniques explicitly targeted to improving accuracy. 2 / 90

Where Are We? Linear Discriminant Analysis 1 Maximum Mutual Information Estimation 2 ROVER 3 Consensus Decoding 4 3 / 90

Where Are We? Linear Discriminant Analysis 1 LDA - Motivation Eigenvectors and Eigenvalues PCA - Derivation LDA - Derivation Applying LDA to Speech Recognition 4 / 90

Linear Discriminant Analysis - Motivation In a typical HMM using Gaussian Mixture Models we assume diagonal covariances. This assumes that the classes to be discriminated between lie along the coordinate axes: What if that is NOT the case? 5 / 90

Principle Component Analysis-Motivation We are in trouble. First, we can try to rotate the coordinate axes to better lie along the main sources of variation. 6 / 90

Linear Discriminant Analysis - Motivation If the main sources of class variation do NOT lie along the main source of variation we need to find the best directions: 7 / 90

Linear Discriminant Analysis - Computation How do we find the best directions? 8 / 90

Eigenvectors and Eigenvalues A key concept in finding good directions are the eigenvalues and eigenvectors of a matrix. The eigenvalues and eigenvectors of a matrix are defined by the following matrix equation: Ax = λ x For a given matrix A the eigenvectors are defined as those vectors x for which the above statement is true. Each eigenvector has an associated eigenvalue, λ . 10 / 90

Eigenvectors and Eigenvalues - continued To solve this equation, we can rewrite it as ( A − λ I ) x = 0 If x is non-zero, the only way this equation can be solved is if the determinant of the matrix ( A − λ I ) is zero. The determinant of this matrix is a polynomial (called the characteristic polynomial ) p ( λ ) . The roots of this polynomial will be the eigenvalues of A . 11 / 90

Eigenvectors and Eigenvalues - continued For example, let us say � � 2 − 4 A = . − 1 − 1 In such a case, � � 2 − λ − 4 � � p ( λ ) = � � − 1 − 1 − λ � � = ( 2 − λ )( − 1 − λ ) − ( − 4 )( − 1 ) λ 2 − λ − 6 = = ( λ − 3 )( λ + 2 ) Therefore, λ 1 = 3 and λ 2 = − 2 are the eigenvalues of A . 12 / 90

Eigenvectors and Eigenvalues - continued To find the eigenvectors, we simply plug in the eigenvalues into ( A − λ I ) x = 0 and solve for x . For example, for λ 1 = 3 we get � 2 − 3 � � x 1 � 0 � � − 4 = − 1 − 1 − 3 x 2 0 Solving this, we find that x 1 = − 4 x 2 , so all the eigenvector corresponding to λ 1 = 3 is a multiple of [ − 4 1 ] T . Similarly, we find that the eigenvector corresponding to λ 1 = − 2 is a multiple of [ 1 1 ] T . 13 / 90

Principal Component Analysis-Derivation PCA assumes that the directions with "maximum" variance are the "best" directions for discrimination. Do you agree? Problem 1: First consider the problem of "best" representing a set of vectors x 1 , x 2 , . . . , x n by a single vector x 0 . Find x 0 that minimizes the sum of the squared distances from the overall set of vectors. N � | x k − x 0 | 2 J 0 ( x 0 ) = k = 1 15 / 90

Principal Component Analysis-Derivation PCA assumes that the directions with "maximum" variance are the "best" directions for discrimination. Do you agree? Problem 1: First consider the problem of "best" representing a set of vectors x 1 , x 2 , . . . , x n by a single vector x 0 . Find x 0 that minimizes the sum of the squared distances from the overall set of vectors. N � | x k − x 0 | 2 J 0 ( x 0 ) = k = 1 16 / 90

Principal Component Analysis-Derivation It is easy to show that the sample mean, m , minimizes J 0 , where m is given by N m = x 0 = 1 � x k N k = 1 17 / 90

Principal Component Analysis-Derivation Problem 2: Given we have the mean m , how do we find the next single direction that best explains the variation between vectors? Let e be a unit vector in this "best" direction. In such a case, we can express a vector x as x = m + a e 18 / 90

Principal Component Analysis-Derivation For the vectors x k we can find a set of a k s that minimizes the mean square error: N � | x k − ( m + a k e ) | 2 J 1 ( a 1 , a 2 , . . . , a N , e ) = k = 1 If we differentiate the above with respect to a k we get a k = e T ( x k − m ) 19 / 90

Principal Component Analysis-Derivation How do we find the best direction e ? If we substitute the above solution for a k into the formula for the overall mean square error we get after some manipulation: N � J 1 ( e ) = − e T Se + | x k − m | 2 k = 1 where S is called the Scatter matrix and is given by: N � ( x k − m )( x k − m ) T S = k = 1 Notice the scatter matrix just looks like N times the sample covariance matrix of the data. 20 / 90

Principal Component Analysis-Derivation To minimize J 1 we want to maximize e T Se subject to the constraint that | e | = e T e = 1. Using Lagrange multipliers we write u = e T Se − λ e T e Differentiating u w.r.t e and setting to zero we get: 2 Se − 2 λ e = 0 or Se = λ e So to maximize e T Se we want to select the eigenvector of S corresponding to the largest eigenvalue of S . 21 / 90

Principal Component Analysis-Derivation Problem 3: How do we find the best d directions? Express x as d � x = m + a i e i i = 1 In this case, we can write the mean square error as N d � � a ki e i ) − x k | 2 J d = | ( m + k = 1 i = 1 and it is not hard to show that J d is minimized when the vectors e 1 , e 2 , . . . , e d correspond to the d largest eigenvectors of the scatter matrix S . 22 / 90

Linear Discriminant Analysis - Derivation What if the class variation does NOT lie along the directions of maximum data variance? Let us say we have vectors corresponding to c classes of data. We can define a set of scatter matrices as above as � ( x − m i )( x − m i ) T S i = x ∈D i where m i is the mean of class i . In this case we can define the within-class scatter (essentially the average scatter across the classes relative to the mean of each class) as just: c � S W = S i i = 1 24 / 90

Linear Discriminant Analysis - Derivation 25 / 90

Linear Discriminant Analysis - Derivation Another useful scatter matrix is the between class scatter matrix, defined as c � ( m i − m )( m i − m ) T S B = i = 1 26 / 90

Linear Discriminant Analysis - Derivation We would like to determine a set of directions V such that the classes c are maximally discriminable in the new coordinate space given by ˜ x = Vx 27 / 90

Linear Discriminant Analysis - Derivation A reasonable measure of discriminability is the ratio of the volumes represented by the scatter matrices. Since the determinant of a matrix is a measure of the corresponding volume, we can use the ratio of determinants as a measure: J = | S B | | S W | Why is this a good thing? So we want to find a set of directions that maximize this expression. 28 / 90

Linear Discriminant Analysis - Derivation A reasonable measure of discriminability is the ratio of the volumes represented by the scatter matrices. Since the determinant of a matrix is a measure of the corresponding volume, we can use the ratio of determinants as a measure: J = | S B | | S W | Why is this a good thing? So we want to find a set of directions that maximize this expression. 29 / 90

Linear Discriminant Analysis - Derivation With a little bit of manipulation similar to that in PCA, it turns out that the solution are the eigenvectors of the matrix S − 1 W S B which can be generated by most common mathematical packages. 30 / 90

Linear Discriminant Analysis in Speech Recognition The most successful uses of LDA in speech recognition are achieved in an interesting fashion. Speech recognition training data are aligned against the underlying words using the Viterbi alignment algorithm described in Lecture 4. Using this alignment, each cepstral vector is tagged with a different phone or sub-phone. For English this typically results in a set of 156 (52x3) classes. For each time t the cepstral vector x t is spliced together with N / 2 vectors on the left and right to form a “supervector” of N cepstral vectors. ( N is typically 5-9 frames.) Call this “supervector” y t . 32 / 90

Lecture 10 Discriminative Training, ROVER, and Consensus Michael - PowerPoint PPT Presentation

Lecture 10 Discriminative Training, ROVER, and Consensus Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen}@us.ibm.com 10 December 2012 General

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Chapter 6. Converter Circuits Where do the boost, 6.1. Circuit manipulations buck-boost,

Physical Information Security Fall 2010 CS461/ECE422 Computer Security I Reading Material

Milestone Meeting #3 - Demo - Michael Knigs Michael Buler Virtual Aachen Project 08 Team

Imperfect Competition, Compensating Differentials and Rent Sharing in the U.S. Labor Market June

OFTOs and Round 3 Chris Veal When to initiate OFTO appointment Options cover a spectrum but

Dialogue Dan Jurafsky Lecture 6: Waveform Synthesis (in Concatenative TTS) IP Notice: many of

One-of-a-kind Research network ECMC Programme Office A Research Network like no other Advancing

Cancellation of the Maternal and Extraction of the Fetal ECG in Noninvasive Recordings Ivaylo

Sambuz

Useful Links

Newsletter

Mail Us