WFSTs in ASR & Basics of Speech Production Lecture 6 CS 753 - - PowerPoint PPT Presentation

wfsts in asr basics of speech production
SMART_READER_LITE
LIVE PREVIEW

WFSTs in ASR & Basics of Speech Production Lecture 6 CS 753 - - PowerPoint PPT Presentation

WFSTs in ASR & Basics of Speech Production Lecture 6 CS 753 Instructor: Preethi Jyothi Determinization/Minimization: Recap A (W)FST is deterministic if: Unique start state No two transitions from a state share the same input


slide-1
SLIDE 1

Instructor: Preethi Jyothi

WFSTs in ASR & Basics of Speech Production

Lecture 6

CS 753

slide-2
SLIDE 2

Determinization/Minimization: Recap

  • A (W)FST is deterministic if:
  • Unique start state
  • No two transitions from a state share the same input label
  • No epsilon input labels
  • Minimization finds an equivalent deterministic FST with the least

number of states (and transitions)

  • For a deterministic weighted automaton, weight pushing +

(unweighted) automata minimization leads to a minimal weighted automaton

  • Guaranteed to yield a deterministic/minimized WFSA under some

technical conditions characterising the automata (e.g. twins property) and the weight semiring (allowing for weight pushing)

slide-3
SLIDE 3

1 b:bad 4 c:cab 7 b:bead 11 c:cede 15 d:decade 2 a:eps 5 a:eps 8 e:eps 12 e:eps 16 e:eps 3 d:eps 6 b:eps 9 a:eps 10 d:eps 13 d:eps 14 e:eps 17 c:eps 18 a:eps 19 d:eps 20 e:eps

Example: Dictionary WFST

slide-4
SLIDE 4

1 b:eps 2 c:eps 3 d:decade 4 a:bad 5 e:bead 6 a:cab 7 e:cede 8 e:eps 9 d:eps 10 a:eps 11 b:eps 12 d:eps 13 c:eps 14 d:eps 15 e:eps 16 a:eps 17 d:eps 18 e:eps

Determinized Dictionary WFST

slide-5
SLIDE 5

1 b:eps 2 c:eps 3 d:decade 4 a:bad 5 e:bead 6 a:cab 7 e:cede 8 e:eps 9 d:eps a:eps b:eps 10 d:eps 11 c:eps e:eps a:eps

Minimized Dictionary WFST

slide-6
SLIDE 6

Acoustic
 Indices

WFST-based ASR System

Language
 Model

Word
 Sequence

Acoustic
 Models Triphones Context
 Transducer Monophones Pronunciation
 Model Words

slide-7
SLIDE 7

Acoustic
 Indices

Language
 Model

Word
 Sequence

Acoustic
 Models Triphones Context
 Transducer Monophones Pronunciation
 Model Words

H

a/a_b b/a_b

. . .

x/y_z

One 3-state 
 HMM for 
 each 
 triphone

f1:ε

}

FST Union + Closure Resulting FST

H

f2:ε f3:ε f4:ε f5:ε f4:ε f6:ε f0:a+a+b

WFST-based ASR System

slide-8
SLIDE 8

ε,* x,ε x:x/ ε_ε x,x x:x/ ε_x x,y x:x/ ε_y y,ε y:y/ ε_ε y,x y:y/ ε_x y,y y:y/ ε_y x:x/x_ε x:x/x_x x:x/x_y y:y/x_ ε y:y/x_x y:y/x_y x:x/y_ε x:x/y_x x:x/y_y y:y/y_ε y:y/y_x y:y/y_y

Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002

Arc labels: “monophone : phone / left-context_right-context” C-1:

C

Acoustic
 Indices

Language
 Model

Word
 Sequence

Acoustic
 Models Triphones Context
 Transducer Monophones Pronunciation
 Model Words

WFST-based ASR System

slide-9
SLIDE 9

Acoustic
 Indices

Language
 Model

Word
 Sequence

Acoustic
 Models Triphones Context
 Transducer Monophones Pronunciation
 Model Words

L

Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002

(a)

(b) 1 d:data/1 5 d:dew/1 2 ey:ε/0.5 ae:ε/0.5 6 uw:ε/1 3 t:ε/0.3 dx:ε/0.7 4 ax: ε/1

WFST-based ASR System

slide-10
SLIDE 10

Acoustic
 Indices

Language
 Model

Word
 Sequence

Acoustic
 Models Triphones Context
 Transducer Monophones Pronunciation
 Model Words

the birds/0.404 animals/1.789 are/0.693 were/0.693 boy/1.789 is walking

G

WFST-based ASR System

slide-11
SLIDE 11

Constructing the Decoding Graph

Acoustic
 Indices

Language
 Model

Word
 Sequence

Acoustic
 Models Triphones Context
 Transducer Monophones Pronunciation
 Model Words

H C L G

“Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

Decoding graph, D = H ⚬ C ⚬ L ⚬ G

Construct decoding search graph using H ⚬ C ⚬ L ⚬ G that maps 
 acoustic states to word sequences Carefully construct D using optimization algorithms: D = min(det(H ⚬ det(C ⚬ det(L ⚬ G)))) Decode test utterance O by aligning acceptor X (corresponding to O) 
 with H ⚬ C ⚬ L ⚬ G: X ⚬ H ⚬ C ⚬ L ⚬ G

W ∗ = arg min

W =out[π]

where π is a path in the composed FST, out[π] is the output label sequence of π

slide-12
SLIDE 12

Acoustic
 Indices

Language
 Model

Word
 Sequence

Acoustic
 Models Triphones Context
 Transducer Monophones Pronunciation
 Model Words

H C L G

“Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

Structure of X (derived from O):

f0:10.578 f1:14.221 f1000:5.678 f500:8.123 ⠇ f0:9.21 f1:5.645 f1000:15.638 f500:11.233 ⠇ f0:19.12 f1:13.45 f1000:11.11 f500:20.21 ⠇ ………… f0:18.52 f1:12.33 f1000:15.99 f500:10.21 ⠇

Decode test utterance O by aligning acceptor X (corresponding to O) 
 with H ⚬ C ⚬ L ⚬ G: X ⚬ H ⚬ C ⚬ L ⚬ G

W ∗ = arg min

W =out[π]

where π is a path in the composed FST, out[π] is the output label sequence of π

Constructing the Decoding Graph

slide-13
SLIDE 13

Acoustic
 Indices

Language
 Model

Word
 Sequence

Acoustic
 Models Triphones Context
 Transducer Monophones Pronunciation
 Model Words

H C L G

“Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

  • Each fk maps to a distinct triphone HMM state j
  • Weights of arcs in the ith chain link correspond to observation probabilities bj(oi)
  • X is a very large FST which is never explicitly constructed!
  • H ⚬ C ⚬ L ⚬ G is typically traversed dynamically (search algorithms will be covered 


later in the semester)

f0:10.578 f1:14.221 f1000:5.678 f500:8.123 ⠇ f0:9.21 f1:5.645 f1000:15.638 f500:11.233 ⠇ f0:19.12 f1:13.45 f1000:11.11 f500:20.21 ⠇ ………… f0:18.52 f1:12.33 f1000:15.99 f500:10.21 ⠇

X

Constructing the Decoding Graph

slide-14
SLIDE 14

network states transitions G 1,339,664 3,926,010 L G 8,606,729 11,406,721 det(L G) 7,082,404 9,836,629 C det(L G)) 7,273,035 10,201,269 det(H C L G) 18,317,359 21,237,992

network x real-time C L G 12.5 C det(L G) 1.2 det(H C L G) 1.0 push(min(F)) 0.7

Impact of WFST Optimizations

40K NAB Evaluation Set ’95 (83% word accuracy)

Tables from http://www.openfst.org/twiki/pub/FST/FstHltTutorial/tutorial_part3.pdf

slide-15
SLIDE 15

Toolkits to work with finite-state machines

  • AT&T FSM Library (no longer supported)


http://www3.cs.stonybrook.edu/~algorith/implement/fsm/ implement.shtml

  • RWTH FSA Toolkit


https://www-i6.informatik.rwth-aachen.de/~kanthak/fsa.html

  • Carmel


https://www.isi.edu/licensed-sw/carmel/

  • MIT FST Toolkit


http://people.csail.mit.edu/ilh/fst/

  • OpenFST Toolkit (actively supported)


http://www.openfst.org/twiki/bin/view/FST/WebHome

slide-16
SLIDE 16

Brief Introduction to the OpenFST Toolkit

slide-17
SLIDE 17

ε:n

a:a an:a

1 an a <eps> an 1 a 2 <eps> a 1 n 2 1 2 <eps> n 2 a a 1 2

Input
 alphabet 
 (in.txt) Output
 alphabet
 (out.txt)

“0”labelisreserved forepsilon

A.txt

Quick Intro to OpenFst (www.openfst.org)

slide-18
SLIDE 18

ε:n/1.0

a:a/0.5 2/0. an:a/0.5

1 an a 0.5 1 2 <eps> n 1.0 2 a a 0.5 1 2 0.1

Quick Intro to OpenFst (www.openfst.org)

slide-19
SLIDE 19

Compiling & Printing FSTs

The text FSTs need to be “compiled” into binary objects before further use with OpenFst utilities

  • Command used to compile:

fstcompile --isymbols=in.txt --osymbols=out.txt A.txt A.fst

  • Get back the text FST using a print command with the binary file:

fstprint --isymbols=in.txt --osymbols=out.txt A.fst A.txt

slide-20
SLIDE 20

Composing FSTs

The text FSTs need to be “compiled” into binary objects before further use with OpenFst utilities

  • Command used to compose:

fstcompose A.fst B.fst AB.fst

  • OpenFST requirement: One or both of the input FSTs should be

appropriately sorted before composition

fstarcsort —-sort_type=olabel A.fst |\
 fstcompose - B.fst AB.fst

slide-21
SLIDE 21

Drawing FSTs

Small FSTs can be visualized easily using the draw tool:

fstdraw --isymbols=in.txt --osymbols=out.txt A.fst |\ dot -Tpdf > A.pdf

1 an:a 2 a:a <eps>:n

slide-22
SLIDE 22

FSTs can get very large!

slide-23
SLIDE 23

Basics of Speech Production

slide-24
SLIDE 24

Speech Production

Schematic representation of the 
 vocal organs

Schematic from L.Rabiner and B.-H.Juang , Fundamentals of speech recognition, 1993 Figure from http://www.phon.ucl.ac.uk/courses/spsci/iss/week6.php

slide-25
SLIDE 25

Sound units

  • Phones are acoustically distinct units of speech
  • Phonemes are abstract linguistic units that impart different

meanings in a given language

  • Minimal pair: pan vs. ban
  • Allophones are different acoustic realisations of the same phoneme
  • Phonetics is the study of speech sounds and how they’re produced
  • Phonology is the study of patterns of sounds in different languages
slide-26
SLIDE 26

Vowels

  • Sounds produced with no obstruction to the flow of air

through the vocal tract

Image from https://en.wikipedia.org/wiki/File:IPA_vowel_chart_2005.png

VOWEL QUADRILATERAL

slide-27
SLIDE 27

Formants of vowels

  • Formants are resonance frequencies of the vocal tract (denoted by F1, F2, etc.)
  • F0 denotes the fundamental frequency of the periodic source (vibrating vocal folds)
  • Formant locations specify certain vowel characteristics

Image from: https://www.phon.ucl.ac.uk/courses/spsci/iss/week5.php

slide-28
SLIDE 28

Spectrogram

  • Spectrogram is a sequence of spectra stacked together in

time, with amplitude of the frequency components expressed as a heat map

  • Spectrograms of certain vowels: 


http://www.phon.ucl.ac.uk/courses/spsci/iss/week5.php

  • Praat (http://www.fon.hum.uva.nl/praat/) is a good toolkit to

analyse speech signals (plot spectrograms, generate formants/pitch curves, etc.)

slide-29
SLIDE 29

Consonants (voicing/place/manner)

  • “Consonants are made by restricting or blocking the airflow

in some way, and may be voiced or unvoiced.” (J&M, Ch. 7)

slide-30
SLIDE 30

Voiced/Unvoiced Sounds

  • Sounds made with vocal cords vibrating: voiced
  • E.g. /g/, /d/, etc.
  • All English vowel sounds are voiced
  • Sounds made without vocal cord vibration: voiceless
  • E.g. /k/, /t/, etc.
slide-31
SLIDE 31

Consonants (voicing/place/manner)

  • “Consonants are made by restricting or blocking the airflow

in some way, and may be voiced or unvoiced.” (J&M, Ch. 7)

  • Consonants can be labeled depending on
  • where the constriction is made
  • how the constriction is made
slide-32
SLIDE 32

Place of articulation

  • Bilabial (both lips)


[b],[p],[m], etc.

  • Labiodental (with lower lip

and upper teeth)
 [f], [v], etc.

  • Interdental (tip of tongue

between teeth)
 [ⲑ] (thought), [δ] (this)

slide-33
SLIDE 33

Manner of articulation

  • Plosive/Stop (airflow

completely blocked followed by a release)
 [p],[g],[t],etc.

  • Fricative (constricted airflow)


[f], [s], [th], etc.

  • Affricate (stop + fricative)


[ch], [jh], etc.

  • Nasal (lowering velum)


[n], [m], etc.

See realtime MRI productions of vowels and consonants here: http://sail.usc.edu/span/rtmri_ipa/je_2015.html