[PPT] - WFSTs in ASR & Basics of Speech Production Lecture 6 CS 753 PowerPoint Presentation

SLIDE 1

Instructor: Preethi Jyothi

WFSTs in ASR & Basics of Speech Production

Lecture 6

CS 753

SLIDE 2

Determinization/Minimization: Recap

A (W)FST is deterministic if:
Unique start state
No two transitions from a state share the same input label
No epsilon input labels
Minimization finds an equivalent deterministic FST with the least

number of states (and transitions)

For a deterministic weighted automaton, weight pushing +

(unweighted) automata minimization leads to a minimal weighted automaton

Guaranteed to yield a deterministic/minimized WFSA under some

technical conditions characterising the automata (e.g. twins property) and the weight semiring (allowing for weight pushing)

SLIDE 3

1 b:bad 4 c:cab 7 b:bead 11 c:cede 15 d:decade 2 a:eps 5 a:eps 8 e:eps 12 e:eps 16 e:eps 3 d:eps 6 b:eps 9 a:eps 10 d:eps 13 d:eps 14 e:eps 17 c:eps 18 a:eps 19 d:eps 20 e:eps

Example: Dictionary WFST

SLIDE 4

1 b:eps 2 c:eps 3 d:decade 4 a:bad 5 e:bead 6 a:cab 7 e:cede 8 e:eps 9 d:eps 10 a:eps 11 b:eps 12 d:eps 13 c:eps 14 d:eps 15 e:eps 16 a:eps 17 d:eps 18 e:eps

Determinized Dictionary WFST

SLIDE 5

1 b:eps 2 c:eps 3 d:decade 4 a:bad 5 e:bead 6 a:cab 7 e:cede 8 e:eps 9 d:eps a:eps b:eps 10 d:eps 11 c:eps e:eps a:eps

Minimized Dictionary WFST

SLIDE 6

Acoustic  Indices

WFST-based ASR System

Language  Model

Word  Sequence

Acoustic  Models Triphones Context  Transducer Monophones Pronunciation  Model Words

SLIDE 7

Acoustic  Indices

Language  Model

Word  Sequence

Acoustic  Models Triphones Context  Transducer Monophones Pronunciation  Model Words

H

a/a_b b/a_b

. . .

x/y_z

One 3-state   HMM for   each   triphone

f1:ε

}

FST Union + Closure Resulting FST

H

f2:ε f3:ε f4:ε f5:ε f4:ε f6:ε f0:a+a+b

WFST-based ASR System

SLIDE 8

ε,* x,ε x:x/ ε_ε x,x x:x/ ε_x x,y x:x/ ε_y y,ε y:y/ ε_ε y,x y:y/ ε_x y,y y:y/ ε_y x:x/x_ε x:x/x_x x:x/x_y y:y/x_ ε y:y/x_x y:y/x_y x:x/y_ε x:x/y_x x:x/y_y y:y/y_ε y:y/y_x y:y/y_y

Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002

Arc labels: “monophone : phone / left-context_right-context” C-1:

C

Acoustic  Indices

Language  Model

Word  Sequence

Acoustic  Models Triphones Context  Transducer Monophones Pronunciation  Model Words

WFST-based ASR System

SLIDE 9

Acoustic  Indices

Language  Model

Word  Sequence

Acoustic  Models Triphones Context  Transducer Monophones Pronunciation  Model Words

L

Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002

(a)

(b) 1 d:data/1 5 d:dew/1 2 ey:ε/0.5 ae:ε/0.5 6 uw:ε/1 3 t:ε/0.3 dx:ε/0.7 4 ax: ε/1

WFST-based ASR System

SLIDE 10

Acoustic  Indices

Language  Model

Word  Sequence

Acoustic  Models Triphones Context  Transducer Monophones Pronunciation  Model Words

the birds/0.404 animals/1.789 are/0.693 were/0.693 boy/1.789 is walking

G

WFST-based ASR System

SLIDE 11

Constructing the Decoding Graph

Acoustic  Indices

Language  Model

Word  Sequence

Acoustic  Models Triphones Context  Transducer Monophones Pronunciation  Model Words

H C L G

“Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

Decoding graph, D = H ⚬ C ⚬ L ⚬ G

Construct decoding search graph using H ⚬ C ⚬ L ⚬ G that maps   acoustic states to word sequences Carefully construct D using optimization algorithms: D = min(det(H ⚬ det(C ⚬ det(L ⚬ G)))) Decode test utterance O by aligning acceptor X (corresponding to O)   with H ⚬ C ⚬ L ⚬ G: X ⚬ H ⚬ C ⚬ L ⚬ G

W ∗ = arg min

W =out[π]

where π is a path in the composed FST, out[π] is the output label sequence of π

SLIDE 12

Acoustic  Indices

Language  Model

Word  Sequence

Acoustic  Models Triphones Context  Transducer Monophones Pronunciation  Model Words

H C L G

“Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

Structure of X (derived from O):

f0:10.578 f1:14.221 f1000:5.678 f500:8.123 ⠇ f0:9.21 f1:5.645 f1000:15.638 f500:11.233 ⠇ f0:19.12 f1:13.45 f1000:11.11 f500:20.21 ⠇ ………… f0:18.52 f1:12.33 f1000:15.99 f500:10.21 ⠇

Decode test utterance O by aligning acceptor X (corresponding to O)   with H ⚬ C ⚬ L ⚬ G: X ⚬ H ⚬ C ⚬ L ⚬ G

W ∗ = arg min

W =out[π]

where π is a path in the composed FST, out[π] is the output label sequence of π

Constructing the Decoding Graph

SLIDE 13

Acoustic  Indices

Language  Model

Word  Sequence

Acoustic  Models Triphones Context  Transducer Monophones Pronunciation  Model Words

H C L G

“Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

Each fk maps to a distinct triphone HMM state j
Weights of arcs in the ith chain link correspond to observation probabilities bj(oi)
X is a very large FST which is never explicitly constructed!
H ⚬ C ⚬ L ⚬ G is typically traversed dynamically (search algorithms will be covered

later in the semester)

f0:10.578 f1:14.221 f1000:5.678 f500:8.123 ⠇ f0:9.21 f1:5.645 f1000:15.638 f500:11.233 ⠇ f0:19.12 f1:13.45 f1000:11.11 f500:20.21 ⠇ ………… f0:18.52 f1:12.33 f1000:15.99 f500:10.21 ⠇

X

Constructing the Decoding Graph

SLIDE 14

network states transitions G 1,339,664 3,926,010 L G 8,606,729 11,406,721 det(L G) 7,082,404 9,836,629 C det(L G)) 7,273,035 10,201,269 det(H C L G) 18,317,359 21,237,992

network x real-time C L G 12.5 C det(L G) 1.2 det(H C L G) 1.0 push(min(F)) 0.7

Impact of WFST Optimizations

40K NAB Evaluation Set ’95 (83% word accuracy)

Tables from http://www.openfst.org/twiki/pub/FST/FstHltTutorial/tutorial_part3.pdf

SLIDE 15

Toolkits to work with finite-state machines

AT&T FSM Library (no longer supported)

http://www3.cs.stonybrook.edu/~algorith/implement/fsm/ implement.shtml

RWTH FSA Toolkit

https://www-i6.informatik.rwth-aachen.de/~kanthak/fsa.html

Carmel

https://www.isi.edu/licensed-sw/carmel/

MIT FST Toolkit

http://people.csail.mit.edu/ilh/fst/

OpenFST Toolkit (actively supported)

http://www.openfst.org/twiki/bin/view/FST/WebHome

SLIDE 16

Brief Introduction to the OpenFST Toolkit

SLIDE 17

ε:n

a:a an:a

1 an a <eps> an 1 a 2 <eps> a 1 n 2 1 2 <eps> n 2 a a 1 2

Input  alphabet   (in.txt) Output  alphabet  (out.txt)

“0”labelisreserved forepsilon

A.txt

Quick Intro to OpenFst (www.openfst.org)

SLIDE 18

ε:n/1.0

a:a/0.5 2/0. an:a/0.5

1 an a 0.5 1 2 <eps> n 1.0 2 a a 0.5 1 2 0.1

Quick Intro to OpenFst (www.openfst.org)

SLIDE 19

Compiling & Printing FSTs

The text FSTs need to be “compiled” into binary objects before further use with OpenFst utilities

Command used to compile:

fstcompile --isymbols=in.txt --osymbols=out.txt A.txt A.fst

Get back the text FST using a print command with the binary file:

fstprint --isymbols=in.txt --osymbols=out.txt A.fst A.txt

SLIDE 20

Composing FSTs

The text FSTs need to be “compiled” into binary objects before further use with OpenFst utilities

Command used to compose:

fstcompose A.fst B.fst AB.fst

OpenFST requirement: One or both of the input FSTs should be

appropriately sorted before composition

fstarcsort —-sort_type=olabel A.fst |\  fstcompose - B.fst AB.fst

SLIDE 21

Drawing FSTs

Small FSTs can be visualized easily using the draw tool:

fstdraw --isymbols=in.txt --osymbols=out.txt A.fst |\ dot -Tpdf > A.pdf

1 an:a 2 a:a <eps>:n

SLIDE 22

FSTs can get very large!

SLIDE 23

Basics of Speech Production

SLIDE 24

Speech Production

Schematic representation of the   vocal organs

Schematic from L.Rabiner and B.-H.Juang , Fundamentals of speech recognition, 1993 Figure from http://www.phon.ucl.ac.uk/courses/spsci/iss/week6.php

SLIDE 25

Sound units

Phones are acoustically distinct units of speech
Phonemes are abstract linguistic units that impart different

meanings in a given language

Minimal pair: pan vs. ban
Allophones are different acoustic realisations of the same phoneme
Phonetics is the study of speech sounds and how they’re produced
Phonology is the study of patterns of sounds in different languages

SLIDE 26

Vowels

Sounds produced with no obstruction to the flow of air

through the vocal tract

Image from https://en.wikipedia.org/wiki/File:IPA_vowel_chart_2005.png

VOWEL QUADRILATERAL

SLIDE 27

Formants of vowels

Formants are resonance frequencies of the vocal tract (denoted by F1, F2, etc.)
F0 denotes the fundamental frequency of the periodic source (vibrating vocal folds)
Formant locations specify certain vowel characteristics

Image from: https://www.phon.ucl.ac.uk/courses/spsci/iss/week5.php

SLIDE 28

Spectrogram

Spectrogram is a sequence of spectra stacked together in

time, with amplitude of the frequency components expressed as a heat map

Spectrograms of certain vowels:

http://www.phon.ucl.ac.uk/courses/spsci/iss/week5.php

Praat (http://www.fon.hum.uva.nl/praat/) is a good toolkit to

analyse speech signals (plot spectrograms, generate formants/pitch curves, etc.)

SLIDE 29

Consonants (voicing/place/manner)

“Consonants are made by restricting or blocking the airflow

in some way, and may be voiced or unvoiced.” (J&M, Ch. 7)

SLIDE 30

Voiced/Unvoiced Sounds

Sounds made with vocal cords vibrating: voiced
E.g. /g/, /d/, etc.
All English vowel sounds are voiced
Sounds made without vocal cord vibration: voiceless
E.g. /k/, /t/, etc.

SLIDE 31

Consonants (voicing/place/manner)

“Consonants are made by restricting or blocking the airflow

in some way, and may be voiced or unvoiced.” (J&M, Ch. 7)

Consonants can be labeled depending on
where the constriction is made
how the constriction is made

SLIDE 32

Place of articulation

Bilabial (both lips)

[b],[p],[m], etc.

Labiodental (with lower lip

and upper teeth)  [f], [v], etc.

Interdental (tip of tongue

between teeth)  [ⲑ] (thought), [δ] (this)

SLIDE 33

Manner of articulation

Plosive/Stop (airflow

completely blocked followed by a release)  [p],[g],[t],etc.

Fricative (constricted airflow)

[f], [s], [th], etc.

Affricate (stop + fricative)

[ch], [jh], etc.

Nasal (lowering velum)

[n], [m], etc.

See realtime MRI productions of vowels and consonants here: http://sail.usc.edu/span/rtmri_ipa/je_2015.html

Instructor: Preethi Jyothi

WFSTs in ASR & Basics of Speech Production

Lecture 6

CS 753

Determinization/Minimization: Recap

number of states (and transitions)

(unweighted) automata minimization leads to a minimal weighted automaton

technical conditions characterising the automata (e.g. twins property) and the weight semiring (allowing for weight pushing)

Example: Dictionary WFST

Determinized Dictionary WFST

1 b:eps 2 c:eps 3 d:decade 4 a:bad 5 e:bead 6 a:cab 7 e:cede 8 e:eps 9 d:eps a:eps b:eps 10 d:eps 11 c:eps e:eps a:eps

Minimized Dictionary WFST

WFST-based ASR System

H

. . .

}

H

WFST-based ASR System

C

WFST-based ASR System

L

WFST-based ASR System

G

WFST-based ASR System

Constructing the Decoding Graph

H C L G

W ∗ = arg min

H C L G

Structure of X (derived from O):

Decode test utterance O by aligning acceptor X (corresponding to O) with H ⚬ C ⚬ L ⚬ G: X ⚬ H ⚬ C ⚬ L ⚬ G

W ∗ = arg min

Constructing the Decoding Graph

H C L G

X

Constructing the Decoding Graph

Impact of WFST Optimizations

40K NAB Evaluation Set ’95 (83% word accuracy)

Toolkits to work with finite-state machines

http://www3.cs.stonybrook.edu/~algorith/implement/fsm/ implement.shtml

https://www-i6.informatik.rwth-aachen.de/~kanthak/fsa.html

https://www.isi.edu/licensed-sw/carmel/

http://people.csail.mit.edu/ilh/fst/

http://www.openfst.org/twiki/bin/view/FST/WebHome

Brief Introduction to the OpenFST Toolkit

ε:n

1 an a <eps> an 1 a 2 <eps> a 1 n 2 1 2 <eps> n 2 a a 1 2

“0”labelisreserved forepsilon

Quick Intro to OpenFst (www.openfst.org)

ε:n/1.0

1 an a 0.5 1 2 <eps> n 1.0 2 a a 0.5 1 2 0.1

Quick Intro to OpenFst (www.openfst.org)

Compiling & Printing FSTs

The text FSTs need to be “compiled” into binary objects before further use with OpenFst utilities

Composing FSTs

The text FSTs need to be “compiled” into binary objects before further use with OpenFst utilities

appropriately sorted before composition

Drawing FSTs

Small FSTs can be visualized easily using the draw tool:

1 an:a 2 a:a <eps>:n

FSTs can get very large!

Basics of Speech Production

Speech Production

Sound units

meanings in a given language

Vowels

through the vocal tract

VOWEL QUADRILATERAL

Formants of vowels

Spectrogram

time, with amplitude of the frequency components expressed as a heat map

http://www.phon.ucl.ac.uk/courses/spsci/iss/week5.php

analyse speech signals (plot spectrograms, generate formants/pitch curves, etc.)

Consonants (voicing/place/manner)

in some way, and may be voiced or unvoiced.” (J&M, Ch. 7)

Voiced/Unvoiced Sounds

Consonants (voicing/place/manner)

in some way, and may be voiced or unvoiced.” (J&M, Ch. 7)

Place of articulation

[b],[p],[m], etc.

and upper teeth) [f], [v], etc.

Decode test utterance O by aligning acceptor X (corresponding to O)   with H ⚬ C ⚬ L ⚬ G: X ⚬ H ⚬ C ⚬ L ⚬ G

and upper teeth)  [f], [v], etc.

between teeth)  [ⲑ] (thought), [δ] (this)

completely blocked followed by a release)  [p],[g],[t],etc.