Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation
Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation
Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20: Pronunciation Modeling Instructor: Preethi Jyothi Oct 16, 2017 Pronunciation Dictionary/Lexicon Pronunciation model/dictionary/lexicon: Lists one or more
Pronunciation Dictionary/Lexicon
- Pronunciation model/dictionary/lexicon: Lists one or more
pronunciations for a word
- Typically derived from language experts: Sequence of phones
writuen down for each word
- Dictionary construction involves:
- 1. Selecting what words to include in the dictionary
- 2. Pronunciation of each word (also, check for multiple
pronunciations)
Grapheme-based models
Graphemes vs. Phonemes
- Instead of a pronunciation dictionary, one could represent a
pronunciation as a sequence of graphemes (or letuers). That is, model at the grapheme level.
- Useful technique for low-resourced/under-resourced languages
- Main advantages:
- 1. Avoid the need for phone-based pronunciations
- 2. Avoid the need for a phone alphabet
- 3. Works pretuy well for languages with a direct link between
graphemes (letuers) and phonemes (sounds)
Grapheme-based ASR
Image from: Gales et al., Unicode-based graphemic systems for limited resource languages, ICASSP 15
Language ID System WER (%) Vit CN CNC Kurmanji 205 Phonetic 67.6 65.8 64.1 Kurdish Graphemic 67.0 65.3 Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2 Kazakh 302 Phonetic 54.9 53.5 51.5 Graphemic 54.0 52.7 Telugu 303 Phonetic 70.6 69.1 67.5 Graphemic 70.9 69.5 Lithuanian 304 Phonetic 51.5 50.2 48.3 Graphemic 50.9 49.5
5188
Graphemes vs. Phonemes
- Instead of a pronunciation dictionary, one could represent a
pronunciation as a sequence of graphemes (or letuers)
- Useful technique for low-resourced/under-resourced languages
- Main advantages:
- 1. Avoid the need for phone-based pronunciations
- 2. Avoid the need for a phone alphabet
- 3. Works pretuy well for languages with a direct link between
graphemes (letuers) and phonemes (sounds)
Grapheme to phoneme (G2P) conversion
Grapheme to phoneme (G2P) conversion
- Produce a pronunciation (phoneme sequence) given a writuen
word (grapheme sequence)
- Learn G2P mappings from a pronunciation dictionary
- Useful for:
- ASR systems in languages with no pre-built lexicons
- Speech synthesis systems
- Deriving pronunciations for out-of-vocabulary (OOV) words
G2P conversion (I)
- One popular paradigm: Joint sequence models [BN12]
- Grapheme and phoneme sequences are first aligned using
EM-based algorithm
- Results in a sequence of graphones (joint G-P tokens)
- Ngram models trained on these graphone sequences
- WFST-based implementation of such a joint graphone model
[Phonetisaurus]
[BN12]:Bisani & Ney , “Joint sequence models for grapheme-to-phoneme conversion”,Specom 2012 [Phonetisaurus] J. Novak, Phonetisaurus Toolkit
G2P conversion (II)
- Neural network based methods are the new state-of-the-art
for G2P
- Bidirectional LSTM-based networks using a CTC output
layer [Rao15]. Comparable to Ngram models.
- Incorporate alignment information [Yao15]. Beats Ngram
models.
- No alignment. Encoder-decoder with atuention. Beats the
above systems [Toshniwal16].
LSTM + CTC for G2P conversion [Rao15]
ect Model Word Error Rate (%) Galescu and Allen [4] 28.5 Chen [7] 24.7 Bisani and Ney [2] 24.5 Novak et al. [6] 24.4 Wu et al. [12] 23.4 5-gram FST 27.2 8-gram FST 26.5 Unidirectional LSTM with Full-delay 30.1 DBLSTM-CTC 128 Units 27.9 DBLSTM-CTC 512 Units 25.8 DBLSTM-CTC 512 + 5-gram FST 21.3
[Rao15] Grapheme-to-phoneme conversion using LSTM RNNs, ICASSP 2015
G2P conversion (II)
- Neural network based methods are the new state-of-the-art
for G2P
- Bidirectional LSTM-based networks using a CTC output
layer [Rao15]. Comparable to Ngram models.
- Incorporate alignment information [Yao15]. Beats Ngram
models.
- No alignment. Encoder-decoder with atuention. Beats the
above systems [Toshniwal16].
Seq2seq models
(with alignment information [Yao15])
Method PER (%) WER (%) encoder-decoder LSTM 7.53 29.21 encoder-decoder LSTM (2 layers) 7.63 28.61 uni-directional LSTM 8.22 32.64 uni-directional LSTM (window size 6) 6.58 28.56 bi-directional LSTM 5.98 25.72 bi-directional LSTM (2 layers) 5.84 25.02 bi-directional LSTM (3 layers) 5.45 23.55
Data Method PER (%) WER (%) CMUDict past results [20] 5.88 24.53 bi-directional LSTM 5.45 23.55 NetTalk past results [20] 8.26 33.67 bi-directional LSTM 7.38 30.77 Pronlex past results [20,21] 6.78 27.33 bi-directional LSTM 6.51 26.69
[Yao15] Sequence-to-sequence neural net models for G2P conversion, Interspeech 2015
G2P conversion (II)
- Neural network based methods are the new state-of-the-art
for G2P
- Bidirectional LSTM-based networks using a CTC output
layer [Rao15]. Comparable to Ngram models.
- Incorporate alignment information [Yao15]. Beats Ngram
models.
- No alignment. Encoder-decoder with atuention. Beats the
above systems [Toshniwal16].
[Rao15] Grapheme-to-phoneme conversion using LSTM RNNs, ICASSP 2015 [Yao15] Sequence-to-sequence neural net models for G2P conversion, Interspeech 2015 [Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.
Encoder-decoder + atuention for G2P [Toshniwal16]
[Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.
ct αt yt dt h1 x1 x2 xTg h2 h3 hTg
Attention Layer Encoder
x3
Decoder
Encoder-decoder + atuention for G2P [Toshniwal16]
[Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.
ct αt yt dt h1 x1 x2 xTg h2 h3 hTg
Attention Layer Encoder
x3
Decoder
Data Method PER (%) CMUDict BiDir LSTM + Alignment [6] 5.45 DBLSTM-CTC [5]
- DBLSTM-CTC + 5-gram model [5]
- Encoder-decoder + global attn
5.04 ± 0.03 Encoder-decoder + local-m attn 5.11 ± 0.03 Encoder-decoder + local-p attn 5.39 ± 0.04 Ensemble of 5 [Encoder-decoder + global attn] models 4.69 Pronlex BiDir LSTM + Alignment [6] 6.51 Encoder-decoder + global attn 6.24 ± 0.1 Encoder-decoder + local-m attn 5.99 ± 0.11 Encoder-decoder + local-p attn 6.49 ± 0.06 NetTalk BiDir LSTM + Alignment [6] 7.38 Encoder-decoder + global attn 7.14 ± 0.72 Encoder-decoder + local-m attn 7.13 ± 0.11 Encoder-decoder + local-p attn 8.41 ± 0.19
Sub-phonetic feature-based models
Pronunciation Model
Tends to be highly language dependent Each word is a sequence of phones
Phone-Based Articulatory Features
Parallel streams of articulator movements Based on theory of articulatory phonology1
PHONE
s eh n s
1[C. P. Browman and L. Goldstein, Phonology ‘86]
Pronunciation Model
LIP
- pen/labial
TON.TIP
critical/alveolar mid/alveolar closed/alveolar critical/alveolar
TON.BODY
mid/uvular mid/palatal mid/uvular
GLOTTIS
- pen
critical
- pen
VELUM
closed
- pen
closed
TT-LOC LIP- LOC LIP- OPEN TB-LOC TT- OPEN TB- OPEN VELUM GLOTTIS
PHONE
s eh n s
Articulatory Features
Parallel streams of articulator movements Based on theory of articulatory phonology1
Example: Pronunciations for word “sense”
Simple asynchrony across feature streams can appear as many phone alterations
[Adapted from Livescu ’05]
PHONE TB TT GLOT VEL
mid/uvular mid/palatal mid/uvular critical/alveolar mid/alveolar closed/alveolar critical/alveolar
- pen
critical
- pen
closed closed
- pen
s eh n s
- pen/labial
CANONICAL
LIP PHONE TB TT GLOT VEL
mid/uvular mid/palatal mid/uvular critical/alveolar mid/alveolar closed/alveolar critical/alveolar
- pen
critical
- pen
closed closed
- pen
s eh_n n s
- pen/labial
E.g. OBSERVED t
LIP
Dynamic Bayesian Networks (DBNs)
- Provides a natural framework to efficiently encode
multiple streams of articulatory features
- Simple DBN with three random variables in each time
frame
At-1 At At+1 Bt-1 Bt Bt+1 Ct-1 Ct Ct+1
frame t-1 frame t frame t+1
L-Lag T
- Lag
G-Lag Phn L-Phn Lip-Op T
- Phn
TT
- Op
G-Phn Glot sur Lip-Op sur TT
- Op
sur Glot Posn Word
DBN model of pronunciation
1 2 3 solve s ao l v
Word Posn
Observed feature values
- P. Jyothi, E. Fosler-Lussier & K. Livescu, Interspeech’12
Trans Prev- Phone
s ao l v s ao l v 1
- s
ao l
L-Lag Phn
L-Lag T
- Lag
G-Lag Phn L-Phn Lip-Op T
- Phn
TT
- Op
G-Phn Glot sur Lip-Op sur TT
- Op
sur Glot Posn Word Trans Prev- Phone
Set1 Set2 Set3 Set4 Set5
Factorized DBN model1
Cascade of Finite State Machines
L-Phn Lip-
- p
T
- Phn
TT
- p
G- Phn Glot Posn sur Lip-
- p
sur TT
- p
sur Glot L-Lag T
- Lag
Phn Word G- Lag Trans Prev- Phn
Word Phn, Trans Posn L-Lag T
- Lag
G-Lag Phn, Trans, L-Lag,T
- Lag,
G-Lag
F1 F2
Prev-Phn Phn L-Phn, T
- Phn,
G-Phn Lip-op, TT
- op,
Glot surLip-op, surTT
- op,
surGlot
F3 F4 F5
1[P. Jyothi, E. Fosler-Lussier, K. Livescu, Interspeech ’12]
Weighted Finite State Machine
x1:y1/1.5 x2:y2/1.3 x3:y3/2.0 x4:y4/0.6
Weighted Finite State Machine
Decoding: For input X, find the path with minimum cost.
a* = argmin wα(X, a)
path a
wα(X, a) : weight of path a on input X . where α are learned parameters Linear model: wα(X, a) = α · φ(X, a) . where φ is a feature function .
x1:y1/1.5 x2:y2/1.3 x3:y3/2.0 x4:y4/0.6
Discriminative Training
- Online discriminative training algorithm to learn α
- Similar to structured perceptron [Collins ’02]:
- Each training sample gives a “decoded path” and a “correct path”.
Update α to bias towards correct path.
- Use a large-margin training algorithm adapted to work with
a cascade of finite state machines1
x1:y1/1.5 x2:y2/1.3 x3:y3/2.0 x4:y4/0.6
1[P. Jyothi, E. Fosler-Lussier & K. Livescu, Interspeech-13]