Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR + Basics of Speech Production Instructor: Preethi Jyothi Lecture 4  

Qv iz-1 Postmortem Common Mistakes: • Correct Incorrect Output vocabulary for   • 2(a) used complete   1 (ab)*a words “ZERO”, etc.   rather than le tu ers. 2a (Digits) 2(b) No self-loops on   • start/final state in the   2b (SOS) “SOS” machine. 0 20 40 60 80 2(b) All states marked as   • final.

Project Proposal Start brainstorming! • Discuss potential ideas with me during my o ff ice hours (Thur, • 5.30 pm to 6.30 pm) or schedule a meeting Once decided, send me a (plain ASCII) email specifying: • Title of the project • Full names of all project members • A 300-400 word abstract of the proposed project • Email due by 11.59 pm on Jan 30th. •

Determinization/Minimization: Recap A (W)FST is deterministic if: • Unique start state • No two transitions from a state share the same input label • No epsilon input labels • Minimization finds an equivalent deterministic FST with the least • number of states (and transitions) For a deterministic weighted automaton, weight pushing + • (unweighted) automata minimization leads to a minimal weighted automaton Guaranteed to yield a deterministic/minimized WFSA under some • technical conditions characterising the automata (e.g. twins property) and the weight semiring (allowing for weight pushing)

WFSTs applied to ASR

WFST-based ASR System Acoustic   Context   Pronunciation   Language   Models Transducer Monophones Model Model Acoustic   Word   Triphones Words Indices Sequence

WFST-based ASR System Acoustic   Context   Pronunciation   Language   Models Transducer Monophones Model Model Acoustic   Word   Triphones Words Indices Sequence H a/a_b f 4 : ε f 1 : ε f 3 : ε f 5 : ε f 0 : a+a+b f 2 : ε f 4 : ε f 6 : ε } b/a_b FST Union + One 3-state   Closure HMM for   Resulting . each   FST . triphone H . x/y_z

WFST-based ASR System Acoustic   Context   Pronunciation   Language   Models Transducer Monophones Model Model Acoustic   Word   Triphones Words Indices Sequence C x:x/ ε _ ε y:y/ ε _x x:x/ ε _y x:x/y_x x:x/y_ ε ε ,* x:x/y_y y,x x, ε x:x/x_x x:x/ ε _x y:y/x_x x:x/x_y x,y x,x y:y/x_y y:y/y_x y:y/y_y y,y y:y/y_ ε y:y/x_ ε y, ε x:x/x_ ε y:y/ ε _y y:y/ ε _ ε C -1 : Arc labels: “monophone : phone / le fu -context_right-context” Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002

WFST-based ASR System Acoustic   Context   Pronunciation   Language   Models Transducer Monophones Model Model Acoustic   Word   Triphones Words Indices Sequence L (a) t: ε /0.3 ax: ε /1 ey: ε /0.5 2 3 4 dx: ε /0.7 ae: ε /0.5 d:data/1 1 0 d:dew/1 uw: ε /1 5 6 (b) Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002

WFST-based ASR System Acoustic   Context   Pronunciation   Language   Models Transducer Monophones Model Model Acoustic   Word   Triphones Words Indices Sequence G are/0.693 walking birds/0.404 the 0 were/0.693 animals/1.789 is boy/1.789

Constructing the Decoding Graph Acoustic   Context   Pronunciation   Language   Models Transducer Monophones Model Model Acoustic   Word   Triphones Words Indices Sequence H C L G Decoding graph, D = H ⚬ C ⚬ L ⚬ G Construct decoding search graph using H ⚬ C ⚬ L ⚬ G that maps   acoustic states to word sequences Carefully construct D using optimization algorithms: D = min(det(H ⚬ det(C ⚬ det(L ⚬ G)))) Decode test u tu erance O by aligning acceptor X (corresponding to O )   with H ⚬ C ⚬ L ⚬ G: W ∗ = arg min X ⚬ H ⚬ C ⚬ L ⚬ G W = out [ π ] where π is a path in the composed FST, out [ π ] is the output label sequence of π “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

Constructing the Decoding Graph Acoustic   Context   Pronunciation   Language   Models Transducer Monophones Model Model Acoustic   Word   Triphones Words Indices Sequence H C L G Decode test u tu erance O by aligning acceptor X (corresponding to O )   with H ⚬ C ⚬ L ⚬ G: W ∗ = arg min X ⚬ H ⚬ C ⚬ L ⚬ G W = out [ π ] where π is a path in the composed FST, out [ π ] is the output label sequence of π Structure of X (derived from O): f 0 :19.12 f 0 :18.52 f 0 :10.578 f 0 :9.21 f 1 :12.33 f 1 :13.45 f 1 :5.645 f 1 :14.221 ⠇ ⠇ ⠇ ⠇ ………… f 500 :20.21 f 500 :10.21 f 500 :8.123 f 500 :11.233 f 1000 :11.11 f 1000 :15.99 f 1000 :5.678 f 1000 :15.638 “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

Constructing the Decoding Graph Acoustic   Context   Pronunciation   Language   Models Transducer Monophones Model Model Acoustic   Word   Triphones Words Indices Sequence H C L G f 0 :19.12 f 0 :18.52 f 0 :10.578 f 0 :9.21 f 1 :12.33 X f 1 :13.45 f 1 :5.645 f 1 :14.221 ⠇ ⠇ ⠇ ⠇ ………… f 500 :20.21 f 500 :10.21 f 500 :8.123 f 500 :11.233 f 1000 :11.11 f 1000 :15.99 f 1000 :5.678 f 1000 :15.638 • Each f k maps to a distinct triphone HMM state j • Weights of arcs in the i th chain link correspond to observation probabilities b j (o i ) (discussed in the next lecture) • X is a very large FST which is never explicitly constructed! • H ⚬ C ⚬ L ⚬ G is typically traversed dynamically (search algorithms will be covered later in the semester) “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

Impact of WFST Optimizations 40K NAB Evaluation Set ’95 (83% word accuracy) network states transitions 1,339,664 3,926,010 G 8,606,729 11,406,721 L � G det ( L � G ) 7,082,404 9,836,629 C � det ( L � G )) 7,273,035 10,201,269 det ( H � C � L � G ) 18,317,359 21,237,992 network x real-time 12.5 C � L � G C � det ( L � G ) 1.2 det ( H � C � L � G ) 1.0 push ( min ( F )) 0.7 Tables from h tu p://www.openfst.org/twiki/pub/FST/FstHltTutorial/tutorial_part3.pdf

Basics of Speech Production

Speech Production Schematic representation of the   vocal organs Schematic from L.Rabiner and B.-H.Juang , Fundamentals of speech recognition, 1993 Figure from http://www.phon.ucl.ac.uk/courses/spsci/iss/week6.php

Sound units Phones are acoustically distinct units of speech • Phonemes are abstract linguistic units that impart di ff erent • meanings in a given language Minimal pair: pan vs. ban • Allophones are di ff erent acoustic realisations of the same phoneme • Phonetics is the study of speech sounds and how they’re produced • Phonology is the study of pa tu erns of sounds in di ff erent languages •

Vowels Sounds produced with no obstruction to the flow of air • through the vocal tract VOWEL QUADRILATERAL Image from https://en.wikipedia.org/wiki/File:IPA_vowel_chart_2005.png

Formants of vowels Formants are resonance frequencies of the vocal tract (denoted • by F1, F2, etc.) F0 denotes the fundamental frequency of the periodic source • (vibrating vocal folds) Formant locations specify certain vowel characteristics •

Spectrogram Spectrogram is a sequence of spectra stacked together in time, • with amplitude of the frequency components expressed as a heat map Spectrograms of certain vowels:   • h tu p://www.phon.ucl.ac.uk/courses/spsci/iss/week5.php Praat (h tu p://www.fon.hum.uva.nl/praat/) is a good toolkit to • analyse speech signals (plot spectrograms, generate formants/ pitch curves, etc.)

Consonants (voicing/place/manner) “Consonants are made by restricting or blocking the airflow in • some way, and may be voiced or unvoiced.” (J&M, Ch. 7) Consonants can be labeled depending on • where the constriction is made • how the constriction is made •

Voiced/Unvoiced Sounds Sounds made with vocal cords vibrating: voiced • E.g. /g/, /d/, etc. • All English vowel sounds are voiced • Sounds made without vocal cord vibration: voiceless • E.g. /k/, /t/, etc. •

Place of articulation Bilabial (both lips)   • [b],[p],[m], etc. Labiodental (with lower lip and • upper teeth)   [ f ], [v], etc. Interdental (tip of tongue • between teeth)   [ ⲑ ] (thought), [ δ ] (this)

Place of articulation Alveolar (tongue tip on alveolar • ridge)   [n],[t],[s],etc. Palatal (tongue up close to hard • palate)   [sh], [ch] (palato-alveolar)   [y], etc. Velar (tongue near velum)   • [k], [g], etc. Glo tu al (produced at larynx)   • [h], glo tu al stops.

Manner of articulation Plosive/Stop (airflow • completely blocked followed by a release)   [p],[g],[t],etc. Fricative (constricted airflow)   • [ f ], [s], [th], etc. A ff ricate (stop + fricative)   • [ch], [jh], etc. Nasal (lowering velum)   • [n], [m], etc. See realtime MRI productions of vowels and consonants here: http://sail.usc.edu/span/rtmri_ipa/je_2015.html

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR + Basics of Speech Production Instructor: Preethi Jyothi Lecture 4 Qv iz-1 Postmortem Common Mistakes: Correct Incorrect Output

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 8: Hidden

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFST

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 14: Language

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 19: Search,

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 15: Language

Transducers and Rational Relations A. Anil & K. Sutner Carnegie Mellon University Spring

Towards Probabilistic Acceptors and Transducers for Feature Structures Daniel Quernheim

Lecture 2: Finite-State Methods and Tokenization Julia Hockenmaier juliahmr@illinois.edu 3324

RF Cavity Breakdown Localization: Sensor and Signal Studies on Al Disk Peter Lane Pavel Snopok

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation Gene Kim and

a CONTROLLING A REMOTE ACTUATOR USING A 4-20mA LOOP CONTROL ROOM LOOP 12V TO 32V SUPPLY

Combinatorial Characterization of Transducers with Bounded Variance Sara Kropf

Finite-State Transducers: Applications in Natural Language Processing Heli Uibo Institute of

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR + Basics of Speech Production Instructor: Preethi Jyothi Lecture 4 Qv iz-1 Postmortem Common Mistakes: Correct Incorrect Output

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 8: Hidden

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFST

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 14: Language

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 19: Search,

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 15: Language

Transducers and Rational Relations A. Anil &amp; K. Sutner Carnegie Mellon University Spring

Towards Probabilistic Acceptors and Transducers for Feature Structures Daniel Quernheim

Lecture 2: Finite-State Methods and Tokenization Julia Hockenmaier juliahmr@illinois.edu 3324

RF Cavity Breakdown Localization: Sensor and Signal Studies on Al Disk Peter Lane Pavel Snopok

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation Gene Kim and

a CONTROLLING A REMOTE ACTUATOR USING A 4-20mA LOOP CONTROL ROOM LOOP 12V TO 32V SUPPLY

Combinatorial Characterization of Transducers with Bounded Variance Sara Kropf

Finite-State Transducers: Applications in Natural Language Processing Heli Uibo Institute of

Transducers and Rational Relations A. Anil & K. Sutner Carnegie Mellon University Spring