Acoustic Modeling for Speech Recognition Berlin Chen 2003 - PowerPoint PPT Presentation

Acoustic Modeling for Speech Recognition Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapter 8 2. The HTK Book (for HTK Version 3.2)

Introduction X = x x x , ,..., • For the given acoustic observation , the 1 2 n goal of speech recognition is to find out the = W corresponding word sequence that has w ,w ,...,w 1 2 m ( ) the maximum posterior probability W X P ( ) = ˆ W W X arg max P = W w ,w ,..w ,...,w W 1 2 i m ) ( ) ( { } ∈ W X W where w V : v ,v ,.....,v P P = i 1 2 N arg max ( ) X P W ) ( ) ( = W X W arg max P P W Acoustic Modeling Language Modeling To be discussed Possible speaker, pronunciation, later on ! domain, topic, and variations environment, context, etc. style, etc. 2

Review: HMM Modeling • Acoustic Modeling using HMMs Modeling the cepstral feature vectors Frequency Domain Time Domain overlapping speech frames • Three types of HMM state output probabilities are used 3

Review: HMM Modeling • Discrete HMM (DHMM): b j ( v k )= P ( o t = v k | s t =j ) – The observations are quantized into a number of symbols – The symbols are normally generated by a vector quantizer A left-to-right HMM – With multiple codebooks M ( ) ∑ M ( ) ∑ = = = = o v c 1 b v c p m , s j jm j k jm t k t = = m 1 m 1 4 codebook index

Review: HMM Modeling • Continuous HMM (CHMM) – The state observation distribution of HMM is modeled by multivariate Gaussian mixture density functions ( M mixtures) M ( ) ( ) ∑ = o o b c b j t jm jm t = m 1 ⎛ ⎞ M ⎜ ⎟ ( ) ( ) ( ) , ∑ ⎛ ⎞ M M 1 1 = ∑ ∑ = = − − Σ − − c 1 o µ Σ ⎜ o µ T o µ ⎟ 1 ⎜ ⎟ c N ; , c exp ( ) jm jm jm t jm jm jm t jm t jm ⎝ ⎠ L 1 ⎜ ⎟ = 2 π m 1 Σ = = 2 2 2 m 1 m 1 ⎝ ⎠ jm 5

Review: HMM Modeling • Semicontinuous or tied-mixture HMM (SCHMM) – The HMM state mixture density functions are tied together across all the models to form a set of shared kernels (shared Gaussians) ( ) ( ) L L ( ) ( ) ( ) ∑ ∑ = = o o o µ Σ b b k f v b k N ; , j j k j k k = = k 1 k 1 – With multiple codebooks ( ) ( ) ( ) ( ) M L M L ( ) ∑ ∑ ∑ ∑ = = o o o µ Σ b c b k f v c b k N ; , j m jm m , k m jm m , k m , k 6 = = = = m 1 k 1 m 1 k 1

Review: HMM Modeling • Comparison of Recognition Performance 7

Measures of Speech Recognition Performance • Evaluating the performance of speech recognition systems is critical, and the Word Recognition Error Rate (WER) is one of the most important measures • There are typically three types of word recognition errors – Substitution • An incorrect word was substituted for the correct word – Deletion • A correct word was omitted in the recognized sentence – Insertion • An extra word was added in the recognized sentence • How to determine the minimum error rate? 8

Measures of Speech Recognition Performance • Calculate the WER by aligning the correct word string against the recognized word string – A maximum substring matching problem – Can be handled by dynamic programming deleted Correct : “the effect is clear” • Example: Recognized: “effect is not clear” matched matched inserted – Error analysis: one deletion and one insertion – Measures: word error rate (WER), word correction rate (WCR), word accuracy rate (WAR) Might be higher than 100% + + Sub. Del. Ins. words 2 = = = Word Error Rate 100% 50 % No. of words in the correct sentence 4 WER+ Matched words 3 WAR = = = Word Correction Rate 100% 75 % =100% No. of words in the correct sentence 4 − Matched - Ins. words 3 1 = = = Word Accuracy Rate 100% 50 % No. of words in the correct sentence 4 9 Might be negative

Measures of Speech Recognition Performance • A Dynamic Programming Algorithm (Textbook) //denotes for the word length of the correct/reference sentence //denotes for the word length of the recognized/test sentence minimum word /hit error alignment at the a grid [ i,j ] Test j kinds of alignment /hit Ref i 10

Measures of Speech Recognition Performance Step 2 : Iteration : • Algorithm (by Berlin Chen) = for i 1,..., n { //test = for j 1,..., m { //referen ce Step 1 : Initializa tion : + ⎡ ⎤ = G[i - 1][j] 1 (Insertion ) G[0][0] 0; ⎢ ⎥ + G[i][j - 1] 1 (Delection ) = ⎢ ⎥ for i 1,..., n { //test = G[i][j] min ⎢ ⎥ + = G[i - 1][j - 1] 1 (if LR[i]! LT[i], Substituti on) = + ⎢ ⎥ G[i][0] G[i - 1][0] 1; = ⎣ ⎦ G[i - 1][j - 1] (if LR[i] LT[i], Match) = B[i][0] 1; //Inserti on ⎧ 1; //Inserti on, (Horizonta l Direction) ⎪ } (Horizonta l Direction) ⎪ 2; //Deletio n , (Vertical Direction) = ⎨ B[i][j] for j 1,..., m { //referen ce ⎪ 3 ; //Substitu tion (Diagonal Direction) = + G[0][j] G[0][j - 1] 1; ⎪ ⎩ 4 ; //match (Diagonal Direction) = B[0][j] 2; // Deletion } (Vertical Direction) } //for j, reference } //for i, test Note: the penalties for substitution, deletion Step 3 : Measure and Backtrace : and insertion errors are all set to be 1 here G[n][m] = × Word Error Rate 100% m = − Word Accuracy Rate 100 % Word Error Rate = → → Optimal backtrace path (B[n][m] ..... B[0][0]) = if B[i][j] 1 print " LT[i]" ; //Insertio n, then go left = else if B[i][j] 2 print " LR[j] " ; //Deletion , then go down 11 else print " LR[j] LR[i] " ; //Hit/Matc h or Substituti on, then go down diagonally

Measures of Speech Recognition Performance • A Dynamic Programming Algorithm – Initialization Ins. ( n,m ) m Correct/Reference Word Del. m -1 Sequence . for (j=1;j<=m;j++) { //reference Ins. ( i,j ) grid[0][j] = grid[0][j-1]; ( i -1 ,j ) j Del. grid[0][j].dir = VERT ; . grid[0][j].score . ( i -1 ,j -1) ( i,j -1) += DelPen ; grid[0][j].del ++; . 4 } 3Del. 3 2Del. 2 Del. 1Del. 1 HTK 0 1 2 3 4 5 …. … i … … n -1 n 0 2Ins. 3Ins. 1Ins. Recognized/test Word for (i=1;i<=n;i++) { // test grid[0][0].score = grid[0][0].ins Sequence grid[i][0] = grid[i-1][0]; = grid[0][0].del = 0; grid[i][0].dir = HOR ; grid[0][0].sub = grid[0][0].hit = 0; grid[i][0].score += InsPen ; grid[0][0].dir = NIL; grid[i][0].ins ++; 12 }

Measures of Speech Recognition Performance • Program • Example 1 (Ins,Del,Sub,Hit) Correct for (i=1;i<=n;i++) //test (0,5,0,0) C { gridi = grid[i]; gridi1 = grid[i-1]; Delete C (0,2,2, 1 ) (0,4,0, 1 ) (0,3,1, 1 ) (0,3,1, 1 ) for (j=1;j<=m;j++) //reference (1,2,0, 3 ) or (1,3,0, 2 ) { h = gridi1[j].score +insPen; (0,4,0,0) C d = gridi1[j-1].score; (0,2,1, 1 ) (0,3,0, 1 ) (1,2,0, 2 ) if (lRef[j] != lTest[i]) (1,1,0, 3 ) Hit C d += subPen; v = gridi[j-1].score + delPen; (0,3,0,0) B (0,2,0, 1 ) (1,1,0, 2 ) if (d<=h && d<=v) { /* DIAG = hit or sub */ (1,2,0, 1 ) (2,1,0, 2 ) j gridi[j] = gridi1[j-1]; or (0,1,2, 0 ) Hit B or (1,0,2, 1 ) HTK gridi[j].score = d; gridi[j].dir = DIAG; C (0,2,0,0) (0,1,1,0) (2,0,0, 2 ) (1,1,0, 1 ) (1,0,1, 1 ) if (lRef[j] == lTest[i]) ++gridi[j].hit; or(0,0,2,0) Del C else ++gridi[j].sub; } A (0,1,0,0) else if (h<v) { /* HOR = ins */ (0,0,1,0) (1,0,0, 1 ) (2,0,0, 1 ) (3,0,0, 1 ) gridi[j] = gridi1[j]; gridi[j].score = h; Hit A Test gridi[j].dir = HOR; 0 0 ++ gridi[j].ins; B A B C Ins B } (3,0,0,0) (4,0,0,0) (0,0,0,0) (1,0,0,0) (2,0,0,0) i else { /* VERT = del */ Alignment 1: WER= 60% gridi[j] = gridi[j-1]; Correct: gridi[j].score = v; A C B C C Still have an gridi[j].dir = VERT; Test: B A B C Other optimal ++gridi[j].del; } alignment ! } /* for j */ Hit A Hit c Del c Ins B Del C Hit b } /* for i */ 13

Measures of Speech Recognition Performance Correct • Example 2 (0,5,0,0) C Delete C (0,2,2, 1 ) (0,4,0, 1 ) (0,3,1, 1 ) (0,3,1, 1 ) Note: the penalties for (1,2,1, 2 ) or (1,3,0, 2 ) substitution, deletion (0,4,0,0) C and insertion errors are (0,2,1, 1 ) (0,3,0, 1 ) (1,2,1, 1 ) (1,1,1, 2 ) all set to be 1 here Hit C (0,3,0,0) B (0,2,0, 1 ) (1,1,1, 1 ) (1,2,0, 1 ) (2,1,0, 2 ) (Ins,Del,Sub,Hit) j or (0,1,2, 0 ) or (1,0,2, 1 ) Sub B C (0,2,0,0) (0,1,1,0) (2,0,0, 2 ) (1,0,1, 1 ) (1,1,0, 1 ) or(0,0,2,0) Del C (0,1,0,0) A (0,0,1,0) (1,0,0, 1 ) (2,0,0, 1 ) (3,0,0, 1 ) Hit A Test Alignment 1: WER= 80% 0 0 Ins B (0,0,0,0) B A A C (3,0,0,0) (4,0,0,0) (1,0,0,0) (2,0,0,0) Correct: A C B C C Alignment 3: i B A A C WER=80% Test: Hit A Hit c Del c A C B C C Ins B Del C Sub B Correct: B A A C Test: Correct: A C B C C Alignment 2: B A A C Test: Hit c Del c Ins B Hit A Sub C Del B WER=80% 14 Hit A Del c Del C Hit c Sub B

Acoustic Modeling for Speech Recognition Berlin Chen 2003 - PowerPoint PPT Presentation

Acoustic Modeling for Speech Recognition Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapter 8 2. The HTK Book (for HTK Version 3.2) Introduction X = x x x , ,..., For the given acoustic observation

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Intro Acoustic modelling HMMs Speech

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Acoustic Acoustic Control Systems BV Acoustic Acoustic Control Systems BV Control Systems BV

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 14: Language

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by:

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Effective Open Source Speech Recognition in Your Application #kde-speech Peter Grasch

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Acoustic Modeling for Speech Recognition Berlin Chen 2004 References: 1. X. Huang et. al. Spoken

DC-DC Powering in LHC Phase-1 and Phase-2 Tracker Upgrades Katja Klein RWTH Aachen University

Path-complete Lyapunov techniques And applications Raphal Jungers (UCL, Belgium) IHP, Paris

STABILITY OF SWITCHED SYSTEMS Daniel Liberzon Coordinated Science Laboratory and Dept. of

techniques Raphal Jungers (UCLouvain, Belgium) Dysco17 Leuven, Nov 2017 Outline

P3 ENGLISH CURRICULUM BRIEFING FOR PARENTS 18 Jan 2014 OUTLINE MISSION APPROACH TO EL

Predicate Logic: Syntax Alice Gao Lecture 12 CS 245 Logic and Computation Fall 2019 1 / 28

Sem antics: W orlds The w orld consists of objects that have properties . There are

61A Lecture 27 Announcements Interpreting Scheme The Structure of an Interpreter 4 The