Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation

automatic speech recognition cs753 automatic speech
SMART_READER_LITE
LIVE PREVIEW

Automatic Speech Recognition (CS753) Automatic Speech Recognition - - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20: Pronunciation Modeling Instructor: Preethi Jyothi Oct 16, 2017 Pronunciation Dictionary/Lexicon Pronunciation model/dictionary/lexicon: Lists one or more


slide-1
SLIDE 1

Instructor: Preethi Jyothi Oct 16, 2017


Automatic Speech Recognition (CS753)

Lecture 20: Pronunciation Modeling

Automatic Speech Recognition (CS753)

slide-2
SLIDE 2

Pronunciation Dictionary/Lexicon

  • Pronunciation model/dictionary/lexicon: Lists one or more

pronunciations for a word

  • Typically derived from language experts: Sequence of phones

writuen down for each word

  • Dictionary construction involves:
  • 1. Selecting what words to include in the dictionary
  • 2. Pronunciation of each word (also, check for multiple

pronunciations)

slide-3
SLIDE 3

Grapheme-based models

slide-4
SLIDE 4

Graphemes vs. Phonemes

  • Instead of a pronunciation dictionary, one could represent a

pronunciation as a sequence of graphemes (or letuers). That is, model at the grapheme level.

  • Useful technique for low-resourced/under-resourced languages
  • Main advantages:
  • 1. Avoid the need for phone-based pronunciations
  • 2. Avoid the need for a phone alphabet
  • 3. Works pretuy well for languages with a direct link between

graphemes (letuers) and phonemes (sounds)

slide-5
SLIDE 5

Grapheme-based ASR

Image from: Gales et al., Unicode-based graphemic systems for limited resource languages, ICASSP 15

Language ID System WER (%) Vit CN CNC Kurmanji 205 Phonetic 67.6 65.8 64.1 Kurdish Graphemic 67.0 65.3 Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2 Kazakh 302 Phonetic 54.9 53.5 51.5 Graphemic 54.0 52.7 Telugu 303 Phonetic 70.6 69.1 67.5 Graphemic 70.9 69.5 Lithuanian 304 Phonetic 51.5 50.2 48.3 Graphemic 50.9 49.5

5188

slide-6
SLIDE 6

Graphemes vs. Phonemes

  • Instead of a pronunciation dictionary, one could represent a

pronunciation as a sequence of graphemes (or letuers)

  • Useful technique for low-resourced/under-resourced languages
  • Main advantages:
  • 1. Avoid the need for phone-based pronunciations
  • 2. Avoid the need for a phone alphabet
  • 3. Works pretuy well for languages with a direct link between

graphemes (letuers) and phonemes (sounds)

slide-7
SLIDE 7

Grapheme to phoneme (G2P) conversion

slide-8
SLIDE 8

Grapheme to phoneme (G2P) conversion

  • Produce a pronunciation (phoneme sequence) given a writuen

word (grapheme sequence)

  • Learn G2P mappings from a pronunciation dictionary
  • Useful for:
  • ASR systems in languages with no pre-built lexicons
  • Speech synthesis systems
  • Deriving pronunciations for out-of-vocabulary (OOV) words
slide-9
SLIDE 9

G2P conversion (I)

  • One popular paradigm: Joint sequence models [BN12]
  • Grapheme and phoneme sequences are first aligned using

EM-based algorithm

  • Results in a sequence of graphones (joint G-P tokens)
  • Ngram models trained on these graphone sequences
  • WFST-based implementation of such a joint graphone model

[Phonetisaurus]

[BN12]:Bisani & Ney , “Joint sequence models for grapheme-to-phoneme conversion”,Specom 2012 [Phonetisaurus] J. Novak, Phonetisaurus Toolkit

slide-10
SLIDE 10

G2P conversion (II)

  • Neural network based methods are the new state-of-the-art

for G2P

  • Bidirectional LSTM-based networks using a CTC output

layer [Rao15]. Comparable to Ngram models.

  • Incorporate alignment information [Yao15]. Beats Ngram

models.

  • No alignment. Encoder-decoder with atuention. Beats the

above systems [Toshniwal16].

slide-11
SLIDE 11

LSTM + CTC for G2P conversion [Rao15]

ect Model Word Error Rate (%) Galescu and Allen [4] 28.5 Chen [7] 24.7 Bisani and Ney [2] 24.5 Novak et al. [6] 24.4 Wu et al. [12] 23.4 5-gram FST 27.2 8-gram FST 26.5 Unidirectional LSTM with Full-delay 30.1 DBLSTM-CTC 128 Units 27.9 DBLSTM-CTC 512 Units 25.8 DBLSTM-CTC 512 + 5-gram FST 21.3

[Rao15] Grapheme-to-phoneme conversion using LSTM RNNs, ICASSP 2015

slide-12
SLIDE 12

G2P conversion (II)

  • Neural network based methods are the new state-of-the-art

for G2P

  • Bidirectional LSTM-based networks using a CTC output

layer [Rao15]. Comparable to Ngram models.

  • Incorporate alignment information [Yao15]. Beats Ngram

models.

  • No alignment. Encoder-decoder with atuention. Beats the

above systems [Toshniwal16].

slide-13
SLIDE 13

Seq2seq models 


(with alignment information [Yao15])

Method PER (%) WER (%) encoder-decoder LSTM 7.53 29.21 encoder-decoder LSTM (2 layers) 7.63 28.61 uni-directional LSTM 8.22 32.64 uni-directional LSTM (window size 6) 6.58 28.56 bi-directional LSTM 5.98 25.72 bi-directional LSTM (2 layers) 5.84 25.02 bi-directional LSTM (3 layers) 5.45 23.55

Data Method PER (%) WER (%) CMUDict past results [20] 5.88 24.53 bi-directional LSTM 5.45 23.55 NetTalk past results [20] 8.26 33.67 bi-directional LSTM 7.38 30.77 Pronlex past results [20,21] 6.78 27.33 bi-directional LSTM 6.51 26.69

[Yao15] Sequence-to-sequence neural net models for G2P conversion, Interspeech 2015

slide-14
SLIDE 14

G2P conversion (II)

  • Neural network based methods are the new state-of-the-art

for G2P

  • Bidirectional LSTM-based networks using a CTC output

layer [Rao15]. Comparable to Ngram models.

  • Incorporate alignment information [Yao15]. Beats Ngram

models.

  • No alignment. Encoder-decoder with atuention. Beats the

above systems [Toshniwal16].

[Rao15] Grapheme-to-phoneme conversion using LSTM RNNs, ICASSP 2015 [Yao15] Sequence-to-sequence neural net models for G2P conversion, Interspeech 2015 [Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.

slide-15
SLIDE 15

Encoder-decoder + atuention for G2P [Toshniwal16]

[Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.

ct αt yt dt h1 x1 x2 xTg h2 h3 hTg

Attention Layer Encoder

x3

Decoder

slide-16
SLIDE 16

Encoder-decoder + atuention for G2P [Toshniwal16]

[Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.

ct αt yt dt h1 x1 x2 xTg h2 h3 hTg

Attention Layer Encoder

x3

Decoder

Data Method PER (%) CMUDict BiDir LSTM + Alignment [6] 5.45 DBLSTM-CTC [5]

  • DBLSTM-CTC + 5-gram model [5]
  • Encoder-decoder + global attn

5.04 ± 0.03 Encoder-decoder + local-m attn 5.11 ± 0.03 Encoder-decoder + local-p attn 5.39 ± 0.04 Ensemble of 5 [Encoder-decoder + global attn] models 4.69 Pronlex BiDir LSTM + Alignment [6] 6.51 Encoder-decoder + global attn 6.24 ± 0.1 Encoder-decoder + local-m attn 5.99 ± 0.11 Encoder-decoder + local-p attn 6.49 ± 0.06 NetTalk BiDir LSTM + Alignment [6] 7.38 Encoder-decoder + global attn 7.14 ± 0.72 Encoder-decoder + local-m attn 7.13 ± 0.11 Encoder-decoder + local-p attn 8.41 ± 0.19

slide-17
SLIDE 17

Sub-phonetic feature-based models

slide-18
SLIDE 18

Pronunciation Model

Tends to be highly language dependent Each word is a sequence of phones

Phone-Based Articulatory Features

Parallel streams of articulator movements Based on theory of articulatory phonology1

PHONE

s eh n s

1[C. P. Browman and L. Goldstein, Phonology ‘86]

slide-19
SLIDE 19

Pronunciation Model

LIP

  • pen/labial

TON.TIP

critical/alveolar mid/alveolar closed/alveolar critical/alveolar

TON.BODY

mid/uvular mid/palatal mid/uvular

GLOTTIS

  • pen

critical

  • pen

VELUM

closed

  • pen

closed

TT-LOC LIP- LOC LIP- OPEN TB-LOC TT- OPEN TB- OPEN VELUM GLOTTIS

PHONE

s eh n s

Articulatory Features

Parallel streams of articulator movements Based on theory of articulatory phonology1

slide-20
SLIDE 20

Example: Pronunciations for word “sense”

Simple asynchrony across feature streams 
 can appear as many phone alterations

[Adapted from Livescu ’05]

PHONE TB TT GLOT VEL

mid/uvular mid/palatal mid/uvular critical/alveolar mid/alveolar closed/alveolar critical/alveolar

  • pen

critical

  • pen

closed closed

  • pen

s eh n s

  • pen/labial

CANONICAL

LIP PHONE TB TT GLOT VEL

mid/uvular mid/palatal mid/uvular critical/alveolar mid/alveolar closed/alveolar critical/alveolar

  • pen

critical

  • pen

closed closed

  • pen

s eh_n n s

  • pen/labial

E.g. OBSERVED t

LIP

slide-21
SLIDE 21

Dynamic Bayesian Networks (DBNs)

  • Provides a natural framework to efficiently encode

multiple streams of articulatory features

  • Simple DBN with three random variables in each time

frame

At-1 At At+1 Bt-1 Bt Bt+1 Ct-1 Ct Ct+1

frame t-1 frame t frame t+1

slide-22
SLIDE 22

L-Lag T

  • Lag

G-Lag Phn L-Phn Lip-Op T

  • Phn

TT

  • Op

G-Phn Glot sur Lip-Op sur TT

  • Op

sur Glot Posn Word

DBN model of pronunciation 


1 2 3 solve s ao l v

Word Posn

Observed feature values

  • P. Jyothi, E. Fosler-Lussier & K. Livescu, Interspeech’12

Trans Prev- Phone

s ao l v s ao l v 1

  • s

ao l

L-Lag Phn

slide-23
SLIDE 23

L-Lag T

  • Lag

G-Lag Phn L-Phn Lip-Op T

  • Phn

TT

  • Op

G-Phn Glot sur Lip-Op sur TT

  • Op

sur Glot Posn Word Trans Prev- Phone

Set1 Set2 Set3 Set4 Set5

Factorized DBN model1


slide-24
SLIDE 24

Cascade of Finite State Machines

L-Phn Lip-

  • p

T

  • Phn

TT

  • p

G- Phn Glot Posn sur Lip-

  • p

sur TT

  • p

sur Glot L-Lag T

  • Lag

Phn Word G- Lag Trans Prev- Phn

Word Phn, Trans Posn L-Lag T

  • Lag

G-Lag Phn, Trans, L-Lag,T

  • Lag,

G-Lag

F1 F2

Prev-Phn Phn L-Phn, T

  • Phn,

G-Phn Lip-op, TT

  • op,

Glot surLip-op, surTT

  • op,

surGlot

F3 F4 F5

1[P. Jyothi, E. Fosler-Lussier, K. Livescu, Interspeech ’12]

slide-25
SLIDE 25

Weighted Finite State Machine

x1:y1/1.5 x2:y2/1.3 x3:y3/2.0 x4:y4/0.6

slide-26
SLIDE 26

Weighted Finite State Machine

Decoding: For input X, find the path with minimum cost.

a* = argmin wα(X, a)

path a

wα(X, a) : weight of path a on input X .
 where α are learned parameters Linear model: wα(X, a) = α · φ(X, a) .
 where φ is a feature function .

x1:y1/1.5 x2:y2/1.3 x3:y3/2.0 x4:y4/0.6

slide-27
SLIDE 27

Discriminative Training

  • Online discriminative training algorithm to learn α
  • Similar to structured perceptron [Collins ’02]:
  • Each training sample gives a “decoded path” and a “correct path”.

Update α to bias towards correct path.

  • Use a large-margin training algorithm adapted to work with

a cascade of finite state machines1

x1:y1/1.5 x2:y2/1.3 x3:y3/2.0 x4:y4/0.6

1[P. Jyothi, E. Fosler-Lussier & K. Livescu, Interspeech-13]