ee e6820 speech audio processing recognition lecture 10
play

EE E6820: Speech & Audio Processing & Recognition Lecture - PowerPoint PPT Presentation

EE E6820: Speech & Audio Processing & Recognition Lecture 10: ASR: Sequence Recognition 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic modeling 4 The Hidden Markov Model (HMM) Dan Ellis


  1. EE E6820: Speech & Audio Processing & Recognition Lecture 10: ASR: Sequence Recognition 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic modeling 4 The Hidden Markov Model (HMM) Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e6820/ E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 1

  2. Signal template matching 1 • Framewise comparison of unknown word and stored templates: 70 FIVE 60 50 FOUR Reference 40 ONE TWO THREE 30 20 10 10 20 30 40 50 time /frames Test - distance metric? - comparison between templates? - constraints? E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 2

  3. Dynamic Time Warp (DTW) • Find lowest-cost constrained path: - matrix d(i,j) of distances between input frame f i and reference frame r j - allowable predecessors & transition costs T xy Lowest cost to (i,j) D(i,j) = d(i,j) + min { } D(i-1,j) + T 10 Reference frames r j T 10 D(i,j-1) + T 01 D(i-1,j) D(i-1,j-1) + T 11 T 01 1 Local match cost 1 T Best predecessor D(i-1,j) D(i-1,j) (including transition cost) Input frames f i • Best path via traceback from final state - have to store predecessors for (almost) every (i,j) E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 3

  4. DTW-based recognition • Reference templates for each possible word • Isolated word: - mark endpoints of input word - calculate scores through each template (+prune) - choose best • Continuous speech - one matrix of template slices; special-case constraints at word ends FOUR Reference ONE TWO THREE Input frames E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 4

  5. DTW-based recognition (2) + Successfully handles timing variation + Able to recognize speech at reasonable cost - Distance metric? - pseudo-Euclidean space? - Warp penalties? - How to choose templates? - several templates per word? - choose ‘most representative’? - align and average? → need a rigorous foundation... E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 5

  6. Outline 1 Signal template matching 2 Statistical sequence recognition - state-based modeling 3 Acoustic modeling 4 The Hidden Markov Model (HMM) E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 6

  7. Statistical sequence recognition 2 • DTW limited because it’s hard to optimize - interpretation of distance, transition costs? • Need a theoretical foundation: Probability • Formulate as MAP choice among models: M * p M j X Θ ( , ) = argmax M j - = observed features X - = word-sequence models M j Θ - = all current parameters E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 7

  8. Statistical formulation (2) • Can rearrange via Bayes’ rule (& drop ( ) ): p X M * p M j X Θ ( , ) = argmax M j p X M j Θ A ( , ) p M j Θ L ( ) = argmax M j - ( | ) = likelihood of obs’v’ns under model p X M j - ( ) = prior probability of model p M j Θ - = acoustics-related model parameters A Θ - = language-related model parameters L • Questions: p X M j Θ A ( , ) - what form of model to use for ? Θ - how to find (training)? A - how to solve for (decoding)? M j E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 8

  9. State-based modeling • Assume discrete-state model for the speech: - observations are divided up into time frames → → - model states observations: Model M j states Qk : q 1 q 2 q 3 q 4 q 5 q 6 ... time N X 1 : x 2 x 3 x 4 x 5 x 6 ... observed feature x 1 vectors • Probability of observations given model is: N Q k M j ∑ ( ) ( , ) ⋅ ( ) = p X M j p X 1 p Q k M j all Q k - sum over all possible state sequences Q k • How do observations depend on states? How do state sequences depend on model? E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 9

  10. The speech recognition chain • After classification, still have problem of classifying the sequences of frames: sound Feature calculation feature vectors Acoustic Network classifier weights phone probabilities Word models HMM Language model decoder phone & word labeling • Questions - what to use for the acoustic classifier? - how to represent ‘model’ sequences? - how to score matches? E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 10

  11. Outline 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic modeling - defining targets - neural networks & Gaussian models 4 The Hidden Markov Model (HMM) E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 11

  12. Acoustic Modeling 3 • Goal: Convert features into probabilities of particular labels: i X n ( ) over some state set { q i } p q n i.e find - conventional statistical classification problem • Classifier construction is data-driven - assume we can get examples of known good X s for each of the q i s - calculate model parameters by standard training scheme • Various classifiers can be used - GMMs model distribution under each state - Neural Nets directly estimate posteriors • Different classifiers have different properties - features, labels limit ultimate performance E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 12

  13. Defining classifier targets Choice of { q i } can make a big difference • - must support recognition task - must be a practical classification task • Hand-labeling is one source... - ‘experts’ mark spectrogram boundaries • ...Forced alignment is another - ‘best guess’ with existing classifiers, given words • Result is targets for each training frame: Feature vectors Training g w eh n targets time E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 13

  14. Forced alignment • Best labeling given existing classifier constrained by known word sequence Feature vectors time Existing classifier ow th r iy n s Phone posterior Known word probabilities sequence Constrained Dictionary ow th r iy ... alignment Training ow th r iy targets Classifier training E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 14

  15. Gaussian Mixture Models vs. Neural Nets • GMMs fit distribution of features under states: - separate ‘likelihood’ model for each state q i 1 1 – 1 p x q k ) T Σ k µ k µ ( ) ⋅ ( µ µ ( µ k ) = - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - exp – - - - x – – d Σ k x ⁄ 2 1 2 ( 2 π ) - match any distribution given enough data • Neural nets estimate posteriors directly ∑ ∑ p q k x ( ) [ ⋅ [ ] ] = w jk F w ij x i F j j - parameters set to discriminate classes • Posteriors & likelihoods related by Bayes’ rule: p x q k ) Pr q k ( ⋅ ( ) p q k x ( ) = - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ∑ ( ) Pr q ⋅ ( ) j j p x q j E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 15

  16. Outline 1 Signal template matching 2 Statistical sequence recognition 3 Acoustic classification 4 The Hidden Markov Model (HMM) - generative Markov models - hidden Markov models - model fit likelihood - HMM examples E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 16

  17. Markov models 3 • A (first order) Markov model is a finite-state system whose behavior depends only on the current state • E.g. generative Markov model: .8 .8 q n +1 S .1 p ( q n +1 | q n ) S A B C E A B .1 0 1 0 0 0 S .1 0 .8 .1 .1 0 A .1 .1 .1 q n B 0 .1 .8 .1 0 C 0 .1 .1 .7 .1 C E 0 0 0 0 1 .1 E .7 S A A A A A A A A B B B B B B B B B C C C C B B B B B B C E E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 17

  18. Hidden Markov models • Markov models where state sequence Q = { q n } is not directly observable (= ‘hidden’) • But, observations X do depend on Q : ( ) p x q - x n is rv that depends on current state: State sequence Emission distributions AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC q = A q = B q = C Observation p ( x | q ) 0.8 3 sequence 0.6 x n 2 0.4 1 0.2 0 0 p ( x | q ) 0.8 q = A q = B q = C 3 0.6 x n 2 0.4 1 0.2 0 0 0 10 20 30 0 1 2 3 4 observation x time step n - can still tell something about state seq... E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 18

  19. (Generative) Markov models (2) • HMM is specified by: j q n ( i ) ≡ p q n a ij - transition probabilities – 1 ( i ) ≡ π i p q 1 - (initial state probabilities ) ( i ) ≡ ( ) p x q b i x - emission distributions - states q i k a t • • k a t • • 1.0 0.0 0.0 0.0 - transition k 0.9 0.1 0.0 0.0 • k a t • probabilities a ij a 0.0 0.9 0.1 0.0 t 0.0 0.0 0.9 0.1 k a t • • - emission distributions b i ( x ) p ( x | q ) x E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 19

  20. Markov models for speech • Speech models M j - typ. left-to-right HMMs (sequence constraint) - observation & evolution are conditionally independent of rest given (hidden) state q n q 1 q 2 q 3 q 4 q 5 ae 1 ae 2 ae 3 S E x 1 x 2 x 3 x 4 x 5 - self-loops for time dilation E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend