minimum bayes risk methods in automatic speech recognition
play

Minimum Bayes-Risk Methods in Automatic Speech Recognition Vaibhava - PowerPoint PPT Presentation

Minimum Bayes-Risk Methods in Automatic Speech Recognition Vaibhava Goel IBM William Byrne Johns Hopkins University Pattern Recognition in Speech and Language Processing Chap2 Outline Minimum Bayes-Risk Classification Framework


  1. Minimum Bayes-Risk Methods in Automatic Speech Recognition Vaibhava Goel – IBM William Byrne – Johns Hopkins University Pattern Recognition in Speech and Language Processing – Chap2

  2. Outline • Minimum Bayes-Risk Classification Framework – Likelihood Ratio Based Hypothesis Testing – Maximum A-Posteriori Probability Classification – Previous Studies of Application Sensitive ASR • Practical MBR Procedures for ASR – Summation over Hidden State Sequences – MBR Recognition with N-best Lists – MBR Recognition with Lattices

  3. Outline • Segmental MBR Procedures – Segmental Voting – ROVER – e-ROVER • Experimental Results – Parameter Tuning within the MBR Classification Rule – Utterance Level MBR Word and Keyword Recognition – ROVER and e-ROVER for Multilingual ASR • Summary

  4. • Minimum Bayes-Risk Classification Framework – Likelihood Ratio Based Hypothesis Testing – Maximum A-Posteriori Probability Classification – Previous Studies of Application Sensitive ASR • Practical MBR Procedures for ASR – Summation over Hidden State Sequences – MBR Recognition with N-best Lists – MBR Recognition with Lattices

  5. Minimum Bayes-Risk Classification Framework • Definition: A : acoustic observatio n sequence W : word string A W : the hypothesis space of the observatio n A h δ → A ( A ) : A W : ASR classifier h ′ ′ l ( W , W ) : loss function, where W is mistranscr iption of W P ( W , A ) : true distributi on of speech and language

  6. Minimum Bayes-Risk Classification Framework How to measure classifier performanc e? [ ] ∑ ∑ → δ = δ Using Bayes - risk E l ( W, ( A )) l ( W , ( A )) P ( W , A ) (2.1) P ( W, A ) A W ∑ ′ ∴ δ = ( A ) arg min l ( W, W ) P ( W | A ) (2.2) is chosen to minimize Bayes - risk ′ A ∈ W W h W ⎛ ⎞ ′ δ = but we use ( A ) arg min l ( W , W ) , W is the correct tr anscriptio n of A ⎜ ⎟ c c ⎝ ⎠ ′ ∈ A W W h { } ⇒ = > A A Let W be the subset of W , with nonzero P ( W | A ) W W | P ( W | A ) 0 e e ∑ ′ ∴ δ = Equation 2.2 can be rewritten as ( A ) arg min l ( W, W ) P ( W | A ) (2.4) ′ A ∈ W W A h ∈ W W e ∑ ′ ′ ′ = ∴ δ = Let l ( W, W ) P ( W | A ) S ( W ) ( A ) arg min S ( W ) A ′ ∈ W W A ∈ h W W e

  7. Minimum Bayes-Risk Classification Framework A Since the observatio ns in W serve as the evidence used by MBR classifier . e ∴ A W is refered as evidence space e and P ( W | A ) is refered as evidence distributi on How to define loss function ? Two ways : = ⇒ = δ loss function l ( X,Y ) classifier ( A ) LRT LRT → method likelihood ratio hypothesis testing = ⇒ = δ loss function l ( X,Y ) classifier ( A ) 0 / 1 MAP → method maximum a - posteriori classifica tion

  8. Likelihood Ratio Based Hypothesis Testing = = ⎧ 0 if X H , Y H n n ⎪ = = t if X H , Y H ⎪ { } { } = = = 1 a n If W H , H and W H , H and define l ( X , Y ) ⎨ e n a h n a LRT = = t if X H , Y H ⎪ 2 n a ⎪ = = 0 if X H , Y H ⎩ a a ∑ ′ ∴ δ = ( A ) arg min l ( W, W ) P ( W | A ) ′ A ∈ W W A h ∈ W W e } [ ] ′ ′ = + arg min l ( H , W ) P ( H | A ) l ( H , W ) P ( H | A ) n n a a { ′ ∈ W H , H n a [ ] + ⎧ ⎫ l ( H , H ) P ( H | A ) l ( H , H ) P ( H | A ) , n n n a n a = arg min ⎨ ⎬ [ ] + l ( H , H ) P ( H | A ) l ( H , H ) P ( H | A ) ⎩ ⎭ n a n a a a [ ] [ ] ⎧ t P ( H | A ) , ⎫ ⎧ t P ( A | H ) P ( H ) , ⎫ = 1 a = 1 a a arg min arg min ⎨ ⎬ ⎨ ⎬ [ ] [ ] t P ( H | A ) t P ( A | H ) P ( H ) ⎩ ⎭ ⎩ ⎭ 2 n 2 n n ⎧ P ( A | H ) t P ( H ) > ⎧ H t P ( A | H ) P ( H ) t P ( A | H ) P ( H ) > = ⎪ H n 1 a t n 2 n n 1 a a = = n ⎨ ⎨ P ( A | H ) t P ( H ) a 2 n H otherwise ⎩ ⎪ H otherwise a ⎩ a

  9. Likelihood Ratio Based Hypothesis Testing ⎧ P ( A | H ) > ⎪ H if n t ∴ δ = ( A ) n (2.6) ⎨ P ( A | H ) LRT a ⎪ H otherwise ⎩ a The threshold t is set in an applicatio n specific manner; it determines the balance between false rejection and false aceptance. A null class H n H a alternative class

  10. Maximum A-Posteriori Probability Classification ′ ≠ ⎧ 1 if W W ′ = Define l ( W , W ) ⎨ 0 / 1 0 otherwise ⎩ ∑ ′ ∴ δ = ( A ) arg min l ( W, W ) P ( W | A ) A ′ ∈ W W A h ∈ W W e ∑ = arg min P ( W | A ) ′ A ∈ W W ′ h ≠ W W ( ) ′ = − arg min 1 P ( W | A ) A ′ ∈ W W h ′ = arg max P ( W | A ) ′ A ∈ W W h

  11. Previous Studies of Application Sensitive ASR • Use of risk minimization in automatic speech has not been extensive. • Early investigations into the minimum Bayes-risk training criteria for speech recognizers were performed by Nadas . • However our focus in this chapter is in minimum-risk classification rather than estimation .

  12. Previous Studies of Application Sensitive ASR • Stolcke et.al. proposed an approximation to a minimum Bayes risk classifier for generation of minimum word error rate hypothesis from recognition N-best lists . • Other researchers have proposed posterior probability and confidence based hypothesis selection strategies for word error rate reduction.

  13. • Minimum Bayes-Risk Classification Framework – Likelihood Ratio Based Hypothesis Testing – Maximum A-Posteriori Probability Classification – Previous Studies of Application Sensitive ASR • Practical MBR Procedures for ASR – Summation over Hidden State Sequences – MBR Recognition with N-best Lists – MBR Recognition with Lattices

  14. Practical MBR Procedures for ASR • Why difficult to implement? – The evidence and hypothesis spaces in Equation 2.4 tend to be quite large . – The problem of large spaces is worsened by the fact that an ASR recognizer often has to process many consecutive utterances . – There are efficient DP techniques for MAP recognizer, such methods are not yet available for an MBR recognizer under an arbitrary loss function.

  15. Practical MBR Procedures for ASR • How to implement? – Two implementation: • N-best list rescoring procedure • Search over a recognition lattice – Segment long acoustic data into sentence or phrase length utterances. – Restrict the evidence and hypothesis spaces to manageable sets of word strings.

  16. Summation over Hidden State Sequences • A computational issue associated with the use of HMM in the evidence distribution will be addressed. • How to obtain the true distribution? P ( W ) P ( A | W ) = P ( W | A ) (2.12) P ( A ) Here P ( W ) is approximat ed using a language model , it is usually a Markov chain based N - gram model. P ( A | W ) is usually approximat ed using a HMM called the acoustic model. Let S be the set of all the states in the acoustic HMM P ( A | W ). Let χ denote the set of all possible state sequences that could generate A . The probabilit y P ( A | W ) is computed as ∑ ∑ = = P ( A | W ) P ( A , X | W ) P ( X | W ) P ( A | X , W ) (2.13) ∈ ∈ X χ X χ

  17. Summation over Hidden State Sequences The summation over all possible hidden state sequences is too expensive. A computatio nally feasible alternativ e is to modify the Equation 2.4 as ∑ ′ δ = ( A ) arg min l ( W, W ) P ( W | A ) A ′ ∈ W W A h ∈ W W e ∑ P ( W ) P ( X | W ) P ( A | X , W ) ∑ ∈ ′ = X χ arg min l ( W, W ) P ( A ) A ′ ∈ W W A h ∈ W W e ∑ ∑ ′ = arg min l ( W, W ) P ( W , X , A ) A ′ ∈ W W A ∈ h ∈ X χ W W e ( ) ∑ ′ ′ ≈ arg min l ( W , X ), ( W , X ) P ( W , X , A ) A A ′ ′ ∈ × ( W , X ) W χ A A h ∈ × ( W , X ) W χ e A where χ is a sparse sampling of the most likely state sequences in χ .

  18. Summation over Hidden State Sequences For convenienc e we use W rather tha n ( W , X ) A A × A W rather tha n W χ h h × A A A W rather tha n W χ e e ∑ ′ ∴ δ = we have ( A ) arg min l ( W, W ) P ( W , A ) (2.15) ′ A ∈ W W h W A ∈ W e

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend