automatic speech recognition cs753 automatic speech
play

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR + Basics of Speech Production Instructor: Preethi Jyothi Lecture 4 Qv iz-1 Postmortem Common Mistakes: Correct Incorrect Output


  1. Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR + Basics of Speech Production Instructor: Preethi Jyothi Lecture 4 


  2. Qv iz-1 Postmortem Common Mistakes: • Correct Incorrect Output vocabulary for 
 • 2(a) used complete 
 1 (ab)*a words “ZERO”, etc. 
 rather than le tu ers. 2a (Digits) 2(b) No self-loops on 
 • start/final state in the 
 2b (SOS) “SOS” machine. 0 20 40 60 80 2(b) All states marked as 
 • final.

  3. Project Proposal Start brainstorming! • Discuss potential ideas with me during my o ff ice hours (Thur, • 5.30 pm to 6.30 pm) or schedule a meeting Once decided, send me a (plain ASCII) email specifying: • Title of the project • Full names of all project members • A 300-400 word abstract of the proposed project • Email due by 11.59 pm on Jan 30th. •

  4. Determinization/Minimization: Recap A (W)FST is deterministic if: • Unique start state • No two transitions from a state share the same input label • No epsilon input labels • Minimization finds an equivalent deterministic FST with the least • number of states (and transitions) For a deterministic weighted automaton, weight pushing + • (unweighted) automata minimization leads to a minimal weighted automaton Guaranteed to yield a deterministic/minimized WFSA under some • technical conditions characterising the automata (e.g. twins property) and the weight semiring (allowing for weight pushing)

  5. WFSTs applied to ASR

  6. WFST-based ASR System Acoustic 
 Context 
 Pronunciation 
 Language 
 Models Transducer Monophones Model Model Acoustic 
 Word 
 Triphones Words Indices Sequence

  7. WFST-based ASR System Acoustic 
 Context 
 Pronunciation 
 Language 
 Models Transducer Monophones Model Model Acoustic 
 Word 
 Triphones Words Indices Sequence H a/a_b f 4 : ε f 1 : ε f 3 : ε f 5 : ε f 0 : a+a+b f 2 : ε f 4 : ε f 6 : ε } b/a_b FST Union + One 3-state 
 Closure HMM for 
 Resulting . each 
 FST . triphone H . x/y_z

  8. WFST-based ASR System Acoustic 
 Context 
 Pronunciation 
 Language 
 Models Transducer Monophones Model Model Acoustic 
 Word 
 Triphones Words Indices Sequence C x:x/ ε _ ε y:y/ ε _x x:x/ ε _y x:x/y_x x:x/y_ ε ε ,* x:x/y_y y,x x, ε x:x/x_x x:x/ ε _x y:y/x_x x:x/x_y x,y x,x y:y/x_y y:y/y_x y:y/y_y y,y y:y/y_ ε y:y/x_ ε y, ε x:x/x_ ε y:y/ ε _y y:y/ ε _ ε C -1 : Arc labels: “monophone : phone / le fu -context_right-context” Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002

  9. WFST-based ASR System Acoustic 
 Context 
 Pronunciation 
 Language 
 Models Transducer Monophones Model Model Acoustic 
 Word 
 Triphones Words Indices Sequence L (a) t: ε /0.3 ax: ε /1 ey: ε /0.5 2 3 4 dx: ε /0.7 ae: ε /0.5 d:data/1 1 0 d:dew/1 uw: ε /1 5 6 (b) Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002

  10. WFST-based ASR System Acoustic 
 Context 
 Pronunciation 
 Language 
 Models Transducer Monophones Model Model Acoustic 
 Word 
 Triphones Words Indices Sequence G are/0.693 walking birds/0.404 the 0 were/0.693 animals/1.789 is boy/1.789

  11. Constructing the Decoding Graph Acoustic 
 Context 
 Pronunciation 
 Language 
 Models Transducer Monophones Model Model Acoustic 
 Word 
 Triphones Words Indices Sequence H C L G Decoding graph, D = H ⚬ C ⚬ L ⚬ G Construct decoding search graph using H ⚬ C ⚬ L ⚬ G that maps 
 acoustic states to word sequences Carefully construct D using optimization algorithms: D = min(det(H ⚬ det(C ⚬ det(L ⚬ G)))) Decode test u tu erance O by aligning acceptor X (corresponding to O ) 
 with H ⚬ C ⚬ L ⚬ G: W ∗ = arg min X ⚬ H ⚬ C ⚬ L ⚬ G W = out [ π ] where π is a path in the composed FST, out [ π ] is the output label sequence of π “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

  12. Constructing the Decoding Graph Acoustic 
 Context 
 Pronunciation 
 Language 
 Models Transducer Monophones Model Model Acoustic 
 Word 
 Triphones Words Indices Sequence H C L G Decode test u tu erance O by aligning acceptor X (corresponding to O ) 
 with H ⚬ C ⚬ L ⚬ G: W ∗ = arg min X ⚬ H ⚬ C ⚬ L ⚬ G W = out [ π ] where π is a path in the composed FST, out [ π ] is the output label sequence of π Structure of X (derived from O): f 0 :19.12 f 0 :18.52 f 0 :10.578 f 0 :9.21 f 1 :12.33 f 1 :13.45 f 1 :5.645 f 1 :14.221 ⠇ ⠇ ⠇ ⠇ ………… f 500 :20.21 f 500 :10.21 f 500 :8.123 f 500 :11.233 f 1000 :11.11 f 1000 :15.99 f 1000 :5.678 f 1000 :15.638 “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

  13. Constructing the Decoding Graph Acoustic 
 Context 
 Pronunciation 
 Language 
 Models Transducer Monophones Model Model Acoustic 
 Word 
 Triphones Words Indices Sequence H C L G f 0 :19.12 f 0 :18.52 f 0 :10.578 f 0 :9.21 f 1 :12.33 X f 1 :13.45 f 1 :5.645 f 1 :14.221 ⠇ ⠇ ⠇ ⠇ ………… f 500 :20.21 f 500 :10.21 f 500 :8.123 f 500 :11.233 f 1000 :11.11 f 1000 :15.99 f 1000 :5.678 f 1000 :15.638 • Each f k maps to a distinct triphone HMM state j • Weights of arcs in the i th chain link correspond to observation probabilities b j (o i ) (discussed in the next lecture) • X is a very large FST which is never explicitly constructed! • H ⚬ C ⚬ L ⚬ G is typically traversed dynamically (search algorithms will be covered later in the semester) “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

  14. Impact of WFST Optimizations 40K NAB Evaluation Set ’95 (83% word accuracy) network states transitions 1,339,664 3,926,010 G 8,606,729 11,406,721 L � G det ( L � G ) 7,082,404 9,836,629 C � det ( L � G )) 7,273,035 10,201,269 det ( H � C � L � G ) 18,317,359 21,237,992 network x real-time 12.5 C � L � G C � det ( L � G ) 1.2 det ( H � C � L � G ) 1.0 push ( min ( F )) 0.7 Tables from h tu p://www.openfst.org/twiki/pub/FST/FstHltTutorial/tutorial_part3.pdf

  15. Basics of Speech Production

  16. Speech Production Schematic representation of the 
 vocal organs Schematic from L.Rabiner and B.-H.Juang , Fundamentals of speech recognition, 1993 Figure from http://www.phon.ucl.ac.uk/courses/spsci/iss/week6.php

  17. Sound units Phones are acoustically distinct units of speech • Phonemes are abstract linguistic units that impart di ff erent • meanings in a given language Minimal pair: pan vs. ban • Allophones are di ff erent acoustic realisations of the same phoneme • Phonetics is the study of speech sounds and how they’re produced • Phonology is the study of pa tu erns of sounds in di ff erent languages •

  18. Vowels Sounds produced with no obstruction to the flow of air • through the vocal tract VOWEL QUADRILATERAL Image from https://en.wikipedia.org/wiki/File:IPA_vowel_chart_2005.png

  19. Formants of vowels Formants are resonance frequencies of the vocal tract (denoted • by F1, F2, etc.) F0 denotes the fundamental frequency of the periodic source • (vibrating vocal folds) Formant locations specify certain vowel characteristics •

  20. Spectrogram Spectrogram is a sequence of spectra stacked together in time, • with amplitude of the frequency components expressed as a heat map Spectrograms of certain vowels: 
 • h tu p://www.phon.ucl.ac.uk/courses/spsci/iss/week5.php Praat (h tu p://www.fon.hum.uva.nl/praat/) is a good toolkit to • analyse speech signals (plot spectrograms, generate formants/ pitch curves, etc.)

  21. Consonants (voicing/place/manner) “Consonants are made by restricting or blocking the airflow in • some way, and may be voiced or unvoiced.” (J&M, Ch. 7) Consonants can be labeled depending on • where the constriction is made • how the constriction is made •

  22. Voiced/Unvoiced Sounds Sounds made with vocal cords vibrating: voiced • E.g. /g/, /d/, etc. • All English vowel sounds are voiced • Sounds made without vocal cord vibration: voiceless • E.g. /k/, /t/, etc. •

  23. Place of articulation Bilabial (both lips) 
 • [b],[p],[m], etc. Labiodental (with lower lip and • upper teeth) 
 [ f ], [v], etc. Interdental (tip of tongue • between teeth) 
 [ ⲑ ] (thought), [ δ ] (this)

  24. Place of articulation Alveolar (tongue tip on alveolar • ridge) 
 [n],[t],[s],etc. Palatal (tongue up close to hard • palate) 
 [sh], [ch] (palato-alveolar) 
 [y], etc. Velar (tongue near velum) 
 • [k], [g], etc. Glo tu al (produced at larynx) 
 • [h], glo tu al stops.

  25. Manner of articulation Plosive/Stop (airflow • completely blocked followed by a release) 
 [p],[g],[t],etc. Fricative (constricted airflow) 
 • [ f ], [s], [th], etc. A ff ricate (stop + fricative) 
 • [ch], [jh], etc. Nasal (lowering velum) 
 • [n], [m], etc. See realtime MRI productions of vowels and consonants here: http://sail.usc.edu/span/rtmri_ipa/je_2015.html

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend