acoustic modeling for speech recognition
play

Acoustic Modeling for Speech Recognition Berlin Chen 2003 - PowerPoint PPT Presentation

Acoustic Modeling for Speech Recognition Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapter 8 2. The HTK Book (for HTK Version 3.2) Introduction X = x x x , ,..., For the given acoustic observation


  1. Acoustic Modeling for Speech Recognition Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapter 8 2. The HTK Book (for HTK Version 3.2)

  2. Introduction X = x x x , ,..., • For the given acoustic observation , the 1 2 n goal of speech recognition is to find out the = W corresponding word sequence that has w ,w ,...,w 1 2 m ( ) the maximum posterior probability W X P ( ) = ˆ W W X arg max P = W w ,w ,..w ,...,w W 1 2 i m ) ( ) ( { } ∈ W X W where w V : v ,v ,.....,v P P = i 1 2 N arg max ( ) X P W ) ( ) ( = W X W arg max P P W Acoustic Modeling Language Modeling To be discussed Possible speaker, pronunciation, later on ! domain, topic, and variations environment, context, etc. style, etc. 2

  3. Review: HMM Modeling • Acoustic Modeling using HMMs Modeling the cepstral feature vectors Frequency Domain Time Domain overlapping speech frames • Three types of HMM state output probabilities are used 3

  4. Review: HMM Modeling • Discrete HMM (DHMM): b j ( v k )= P ( o t = v k | s t =j ) – The observations are quantized into a number of symbols – The symbols are normally generated by a vector quantizer A left-to-right HMM – With multiple codebooks M ( ) ∑ M ( ) ∑ = = = = o v c 1 b v c p m , s j jm j k jm t k t = = m 1 m 1 4 codebook index

  5. Review: HMM Modeling • Continuous HMM (CHMM) – The state observation distribution of HMM is modeled by multivariate Gaussian mixture density functions ( M mixtures) M ( ) ( ) ∑ = o o b c b j t jm jm t = m 1 ⎛ ⎞ M ⎜ ⎟ ( ) ( ) ( ) , ∑ ⎛ ⎞ M M 1 1 = ∑ ∑ = = − − Σ − − c 1 o µ Σ ⎜ o µ T o µ ⎟ 1 ⎜ ⎟ c N ; , c exp ( ) jm jm jm t jm jm jm t jm t jm ⎝ ⎠ L 1 ⎜ ⎟ = 2 π m 1 Σ = = 2 2 2 m 1 m 1 ⎝ ⎠ jm 5

  6. Review: HMM Modeling • Semicontinuous or tied-mixture HMM (SCHMM) – The HMM state mixture density functions are tied together across all the models to form a set of shared kernels (shared Gaussians) ( ) ( ) L L ( ) ( ) ( ) ∑ ∑ = = o o o µ Σ b b k f v b k N ; , j j k j k k = = k 1 k 1 – With multiple codebooks ( ) ( ) ( ) ( ) M L M L ( ) ∑ ∑ ∑ ∑ = = o o o µ Σ b c b k f v c b k N ; , j m jm m , k m jm m , k m , k 6 = = = = m 1 k 1 m 1 k 1

  7. Review: HMM Modeling • Comparison of Recognition Performance 7

  8. Measures of Speech Recognition Performance • Evaluating the performance of speech recognition systems is critical, and the Word Recognition Error Rate (WER) is one of the most important measures • There are typically three types of word recognition errors – Substitution • An incorrect word was substituted for the correct word – Deletion • A correct word was omitted in the recognized sentence – Insertion • An extra word was added in the recognized sentence • How to determine the minimum error rate? 8

  9. Measures of Speech Recognition Performance • Calculate the WER by aligning the correct word string against the recognized word string – A maximum substring matching problem – Can be handled by dynamic programming deleted Correct : “the effect is clear” • Example: Recognized: “effect is not clear” matched matched inserted – Error analysis: one deletion and one insertion – Measures: word error rate (WER), word correction rate (WCR), word accuracy rate (WAR) Might be higher than 100% + + Sub. Del. Ins. words 2 = = = Word Error Rate 100% 50 % No. of words in the correct sentence 4 WER+ Matched words 3 WAR = = = Word Correction Rate 100% 75 % =100% No. of words in the correct sentence 4 − Matched - Ins. words 3 1 = = = Word Accuracy Rate 100% 50 % No. of words in the correct sentence 4 9 Might be negative

  10. Measures of Speech Recognition Performance • A Dynamic Programming Algorithm (Textbook) //denotes for the word length of the correct/reference sentence //denotes for the word length of the recognized/test sentence minimum word /hit error alignment at the a grid [ i,j ] Test j kinds of alignment /hit Ref i 10

  11. Measures of Speech Recognition Performance Step 2 : Iteration : • Algorithm (by Berlin Chen) = for i 1,..., n { //test = for j 1,..., m { //referen ce Step 1 : Initializa tion : + ⎡ ⎤ = G[i - 1][j] 1 (Insertion ) G[0][0] 0; ⎢ ⎥ + G[i][j - 1] 1 (Delection ) = ⎢ ⎥ for i 1,..., n { //test = G[i][j] min ⎢ ⎥ + = G[i - 1][j - 1] 1 (if LR[i]! LT[i], Substituti on) = + ⎢ ⎥ G[i][0] G[i - 1][0] 1; = ⎣ ⎦ G[i - 1][j - 1] (if LR[i] LT[i], Match) = B[i][0] 1; //Inserti on ⎧ 1; //Inserti on, (Horizonta l Direction) ⎪ } (Horizonta l Direction) ⎪ 2; //Deletio n , (Vertical Direction) = ⎨ B[i][j] for j 1,..., m { //referen ce ⎪ 3 ; //Substitu tion (Diagonal Direction) = + G[0][j] G[0][j - 1] 1; ⎪ ⎩ 4 ; //match (Diagonal Direction) = B[0][j] 2; // Deletion } (Vertical Direction) } //for j, reference } //for i, test Note: the penalties for substitution, deletion Step 3 : Measure and Backtrace : and insertion errors are all set to be 1 here G[n][m] = × Word Error Rate 100% m = − Word Accuracy Rate 100 % Word Error Rate = → → Optimal backtrace path (B[n][m] ..... B[0][0]) = if B[i][j] 1 print " LT[i]" ; //Insertio n, then go left = else if B[i][j] 2 print " LR[j] " ; //Deletion , then go down 11 else print " LR[j] LR[i] " ; //Hit/Matc h or Substituti on, then go down diagonally

  12. Measures of Speech Recognition Performance • A Dynamic Programming Algorithm – Initialization Ins. ( n,m ) m Correct/Reference Word Del. m -1 Sequence . for (j=1;j<=m;j++) { //reference Ins. ( i,j ) grid[0][j] = grid[0][j-1]; ( i -1 ,j ) j Del. grid[0][j].dir = VERT ; . grid[0][j].score . ( i -1 ,j -1) ( i,j -1) += DelPen ; grid[0][j].del ++; . 4 } 3Del. 3 2Del. 2 Del. 1Del. 1 HTK 0 1 2 3 4 5 …. … i … … n -1 n 0 2Ins. 3Ins. 1Ins. Recognized/test Word for (i=1;i<=n;i++) { // test grid[0][0].score = grid[0][0].ins Sequence grid[i][0] = grid[i-1][0]; = grid[0][0].del = 0; grid[i][0].dir = HOR ; grid[0][0].sub = grid[0][0].hit = 0; grid[i][0].score += InsPen ; grid[0][0].dir = NIL; grid[i][0].ins ++; 12 }

  13. Measures of Speech Recognition Performance • Program • Example 1 (Ins,Del,Sub,Hit) Correct for (i=1;i<=n;i++) //test (0,5,0,0) C { gridi = grid[i]; gridi1 = grid[i-1]; Delete C (0,2,2, 1 ) (0,4,0, 1 ) (0,3,1, 1 ) (0,3,1, 1 ) for (j=1;j<=m;j++) //reference (1,2,0, 3 ) or (1,3,0, 2 ) { h = gridi1[j].score +insPen; (0,4,0,0) C d = gridi1[j-1].score; (0,2,1, 1 ) (0,3,0, 1 ) (1,2,0, 2 ) if (lRef[j] != lTest[i]) (1,1,0, 3 ) Hit C d += subPen; v = gridi[j-1].score + delPen; (0,3,0,0) B (0,2,0, 1 ) (1,1,0, 2 ) if (d<=h && d<=v) { /* DIAG = hit or sub */ (1,2,0, 1 ) (2,1,0, 2 ) j gridi[j] = gridi1[j-1]; or (0,1,2, 0 ) Hit B or (1,0,2, 1 ) HTK gridi[j].score = d; gridi[j].dir = DIAG; C (0,2,0,0) (0,1,1,0) (2,0,0, 2 ) (1,1,0, 1 ) (1,0,1, 1 ) if (lRef[j] == lTest[i]) ++gridi[j].hit; or(0,0,2,0) Del C else ++gridi[j].sub; } A (0,1,0,0) else if (h<v) { /* HOR = ins */ (0,0,1,0) (1,0,0, 1 ) (2,0,0, 1 ) (3,0,0, 1 ) gridi[j] = gridi1[j]; gridi[j].score = h; Hit A Test gridi[j].dir = HOR; 0 0 ++ gridi[j].ins; B A B C Ins B } (3,0,0,0) (4,0,0,0) (0,0,0,0) (1,0,0,0) (2,0,0,0) i else { /* VERT = del */ Alignment 1: WER= 60% gridi[j] = gridi[j-1]; Correct: gridi[j].score = v; A C B C C Still have an gridi[j].dir = VERT; Test: B A B C Other optimal ++gridi[j].del; } alignment ! } /* for j */ Hit A Hit c Del c Ins B Del C Hit b } /* for i */ 13

  14. Measures of Speech Recognition Performance Correct • Example 2 (0,5,0,0) C Delete C (0,2,2, 1 ) (0,4,0, 1 ) (0,3,1, 1 ) (0,3,1, 1 ) Note: the penalties for (1,2,1, 2 ) or (1,3,0, 2 ) substitution, deletion (0,4,0,0) C and insertion errors are (0,2,1, 1 ) (0,3,0, 1 ) (1,2,1, 1 ) (1,1,1, 2 ) all set to be 1 here Hit C (0,3,0,0) B (0,2,0, 1 ) (1,1,1, 1 ) (1,2,0, 1 ) (2,1,0, 2 ) (Ins,Del,Sub,Hit) j or (0,1,2, 0 ) or (1,0,2, 1 ) Sub B C (0,2,0,0) (0,1,1,0) (2,0,0, 2 ) (1,0,1, 1 ) (1,1,0, 1 ) or(0,0,2,0) Del C (0,1,0,0) A (0,0,1,0) (1,0,0, 1 ) (2,0,0, 1 ) (3,0,0, 1 ) Hit A Test Alignment 1: WER= 80% 0 0 Ins B (0,0,0,0) B A A C (3,0,0,0) (4,0,0,0) (1,0,0,0) (2,0,0,0) Correct: A C B C C Alignment 3: i B A A C WER=80% Test: Hit A Hit c Del c A C B C C Ins B Del C Sub B Correct: B A A C Test: Correct: A C B C C Alignment 2: B A A C Test: Hit c Del c Ins B Hit A Sub C Del B WER=80% 14 Hit A Del c Del C Hit c Sub B

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend