[PPT] - Joint Learning of Phonetic Units and Word Pronunciations for ASR PowerPoint Presentation

SLIDE 1

Joint Learning of Phonetic Units and Word Pronunciations for ASR

Chia-ying (Jackie) Lee, Yu Zhang and James Glass

Spoken Language Systems Group MIT Computer Science and Artificial Intelligence Lab Cambridge, MA

1

SLIDE 2

2

World Language Map

Data source: http://www.ethnologue.com/

Roughly 7,000 living languages all around the world
Only 2% are supported by automatic speech recognition (ASR) technology

Region # of living languages Americas 1,060 Africa 2,146 Europe 284 Asia 2,304 Pacific 1,311

SLIDE 3

2

World Language Map

Data source: http://www.ethnologue.com/

Roughly 7,000 living languages all around the world
Only 2% are supported by automatic speech recognition (ASR) technology

Region # of living languages Americas 1,060 Africa 2,146 Europe 284 Asia 2,304 Pacific 1,311

SLIDE 4

2

World Language Map

Data source: http://www.ethnologue.com/

Roughly 7,000 living languages all around the world
Only 2% are supported by automatic speech recognition (ASR) technology

Region # of living languages Americas 1,060 Africa 2,146 Europe 284 Asia 2,304 Pacific 1,311

SLIDE 5

3

2% Language Barrier

Conventional ASR training is expensive
Requires a lot of expert knowledge

SLIDE 6

3

2% Language Barrier

[b][p][k] [ae][iy]... Phonetic inventory

Conventional ASR training is expensive
Requires a lot of expert knowledge

SLIDE 7

3

2% Language Barrier

[b][p][k] [ae][iy]... Phonetic inventory big: [b I g] cat: [k ae t]

...

Lexicon

Conventional ASR training is expensive
Requires a lot of expert knowledge

SLIDE 8

3

2% Language Barrier

[b][p][k] [ae][iy]... Phonetic inventory big: [b I g] cat: [k ae t]

...

Lexicon hello world ... Annotated speech

Conventional ASR training is expensive
Requires a lot of expert knowledge

SLIDE 9

4

2% Language Barrier

big: [b I g] cat: [k ae t]

...

hello world ... [b][p][k] [ae][iy]... Phonetic inventory Lexicon

require linguistic expert knowledge difficult to collect

Annotated speech

Conventional ASR training is expensive
Requires a lot of expert knowledge

SLIDE 10

5

2% Language Barrier

big: [b I g] cat: [k ae t]

...

hello world ... [b][p][k] [ae][iy]...

Conventional ASR training is expensive
Requires a lot of expert knowledge

Annotated speech

easier to generate by non-experts

Phonetic inventory Lexicon

require linguistic expert knowledge difficult to collect

SLIDE 11

Towards ASR Training without Experts

6

big: [b I g] cat: [k ae t]

...

hello world ... [b][p][k] [ae][iy]... Phonetic inventory Lexicon Annotated speech

require linguistic expert knowledge difficult to collect easier to generate by non-experts

SLIDE 12

Towards ASR Training without Experts

7

big: [b I g] cat: [k ae t]

...

hello world ... [b][p][k] [ae][iy]... Phonetic inventory Lexicon Annotated speech

require linguistic expert knowledge difficult to collect easier to generate by non-experts

SLIDE 13

Towards ASR Training without Experts

8

big: [b I g] cat: [k ae t]

...

hello world ... [b][p][k] [ae][iy]... Phonetic inventory Lexicon Annotated speech

require linguistic expert knowledge difficult to collect easier to generate by non-experts

Infer lexicon and phonetic units from transcribed speech

SLIDE 14

Towards ASR Training without Experts

9

big: [b I g] cat: [k ae t]

...

hello world ... [b][p][k] [ae][iy]... Phonetic inventory Lexicon Annotated speech

require linguistic expert knowledge difficult to collect

Infer lexicon and phonetic units from transcribed speech

easier to generate by non-experts

SLIDE 15

Discover Pronunciation Lexicon

10

Learn word pronunciations from transcribed speech

I need to fly to Texas

SLIDE 16

Discover Pronunciation Lexicon

11

[n] [iy] [d] [t][ux] [f] [l] [ay] [t] [ux][t] [e] [k] [s] [ax] [s] [ay] I need to fly to Texas

Learn word pronunciations from transcribed speech

SLIDE 17

Discover Pronunciation Lexicon

12

[n] [iy] [d] [t][ux] [f] [l] [ay] [t] [ux][t] [e] [k] [s] [ax] [s] [ay] I need to fly to Texas

Learn word pronunciations from transcribed speech

SLIDE 18

Discover Pronunciation Lexicon

12

[n] [iy] [d] [t][ux] [f] [l] [ay] [t] [ux][t] [e] [k] [s] [ax] [s] [ay] I need to fly to Texas I : need : to : fly : [ay] [n iy d] [t ux] [f l ay] ...

Learn word pronunciations from transcribed speech

SLIDE 19

Without Linguistic Knowledge

13

ང་གzགས་པོ་sང་ད་གམེད། གས་rེ་ཆེ།

Can we discover the word pronunciations?

SLIDE 20

Without Linguistic Knowledge

14

Can we discover the word pronunciations?

ང་གzགས་པོ་sང་ད་གམེད། གས་rེ་ཆེ།

SLIDE 21

Without Linguistic Knowledge

15

ང་གzགས་པོ་sང་ད་གམེད། གས་rེ་ཆེ།

?

Can we discover the word pronunciations?

SLIDE 22

16

I need to fly to Texas

Latent phone sequence
Latent letter to sound (L2S) mapping rules

Challenges

SLIDE 23

Challenges

17

I need to fly to Texas [n] [iy] [d] [t][ux] [f] [l] [ay] [t] [ux][t] [e] [k] [s] [ax] [s] [ay]

Latent phone sequence
Latent letter to sound (L2S) mapping rules

SLIDE 24

18

I need to fly to Texas [n] [iy] [d] [t][ux] [f] [l] [ay] [t] [ux][t] [e] [k] [s] [ax] [s] [ay]

Latent phone sequence
Latent letter to sound (L2S) mapping rules

Challenges

SLIDE 25

19

Hierarchical Bayesian Model

Unknown L2S rules
Weights over HMMs
Associated with each letter
Unknown phone sequence
Unknown phone inventory
HMM-based mixture model

SLIDE 26

19

Hierarchical Bayesian Model

Unknown L2S rules
Weights over HMMs
Associated with each letter
Unknown phone sequence
Unknown phone inventory
HMM-based mixture model

SLIDE 27

19

Hierarchical Bayesian Model

Unknown L2S rules
Weights over HMMs
Associated with each letter
Unknown phone sequence
Unknown phone inventory
HMM-based mixture model

SLIDE 28

19

Hierarchical Bayesian Model

Unknown L2S rules
Weights over HMMs
Associated with each letter
Unknown phone sequence
Unknown phone inventory
HMM-based mixture model

SLIDE 29

19

Hierarchical Bayesian Model

Unknown L2S rules
Weights over HMMs
Associated with each letter
Unknown phone sequence
Unknown phone inventory
HMM-based mixture model

[s] [iy] [z] [k] θ1

...

θ2 θ3 θK θk: HMM

SLIDE 30

19

Hierarchical Bayesian Model

Unknown L2S rules
Weights over HMMs
Associated with each letter
Unknown phone sequence
Unknown phone inventory
HMM-based mixture model

[s] [iy] [z] [k] θ1

...

θ2 θ3 θK θk: HMM

SLIDE 31

19

Hierarchical Bayesian Model

Unknown L2S rules
Weights over HMMs
Associated with each letter
Unknown phone sequence
Unknown phone inventory
HMM-based mixture model

[s] [iy] [z] [k] θ1

...

θ2 θ3 θK θk: HMM

SLIDE 32

19

Hierarchical Bayesian Model

πs

Unknown L2S rules
Weights over HMMs
Associated with each letter
Unknown phone sequence
Unknown phone inventory
HMM-based mixture model

[s] [iy] [z] [k] θ1

...

θ2 θ3 θK θk: HMM

SLIDE 33

19

Hierarchical Bayesian Model

πs

...

Unknown L2S rules
Weights over HMMs
Associated with each letter
Unknown phone sequence
Unknown phone inventory
HMM-based mixture model

[s] [iy] [z] [k] θ1

...

θ2 θ3 θK θk: HMM

SLIDE 34

19

Hierarchical Bayesian Model

πc πs

...

Unknown L2S rules
Weights over HMMs
Associated with each letter
Unknown phone sequence
Unknown phone inventory
HMM-based mixture model

[s] [iy] [z] [k] θ1

...

θ2 θ3 θK θk: HMM

SLIDE 35

19

Hierarchical Bayesian Model

πc

...

πs

...

Unknown L2S rules
Weights over HMMs
Associated with each letter
Unknown phone sequence
Unknown phone inventory
HMM-based mixture model

[s] [iy] [z] [k] θ1

...

θ2 θ3 θK θk: HMM

SLIDE 36

20

Generative Process

θ1

...

θ2 θ3 θK

...

1 2 3 K

π

li

SLIDE 37

21

Generative Process

θ1

...

θ2 θ3 θK li

red sox

xt

...

1 2 3 K

π

li

SLIDE 38

21

Generative Process

θ1

...

θ2 θ3 θK li

red sox

xt

...

1 2 3 K

π

li

SLIDE 39

22

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni li

red sox

ni xt

...

1 2 3 K

π

li

SLIDE 40

23

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni ni ~ 𝜚li li

red sox

ni

3-dim categorical distribution

xt

...

1 2 3 K

π

li

SLIDE 41

23

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni ni ~ 𝜚li li

red sox

ni

3-dim categorical distribution

xt ~ Dir(η)

...

1 2 3 K

π

li

SLIDE 42

24

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni li

red sox

ni 1 2 𝜚r xt ~ Dir(η)

...

1 2 3 K

π

li

SLIDE 43

24

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni li

red sox

ni 1 2 𝜚r xt ~ Dir(η)

...

1 2 3 K

π

li

SLIDE 44

25

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚r li

red sox

1 ni xt

...

1 2 3 K

π

li

~ Dir(η)

SLIDE 45

26

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚e li

red sox

1 ni xt

...

1 2 3 K

π

li

~ Dir(η)

SLIDE 46

26

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚e li

red sox

1 ni

1 xt

...

1 2 3 K

π

li

~ Dir(η)

SLIDE 47

27

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚d li

red sox

1 1

ni xt

...

1 2 3 K

π

li

~ Dir(η)

SLIDE 48

27

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚d li

red sox

1 1

ni xt

1

...

1 2 3 K

π

li

~ Dir(η)

SLIDE 49

28

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚_ li

red sox

1 1 1

ni xt

...

1 2 3 K

π

li

~ Dir(η)

SLIDE 50

28

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚_ li

red sox

1 1 1

ni xt

...

1 2 3 K

π

li

~ Dir(η)

SLIDE 51

29

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚s li

red sox

1 1 1 0

ni xt

...

1 2 3 K

π

li

~ Dir(η)

SLIDE 52

29

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚s li

red sox

1 1 1 0

ni xt

1

...

1 2 3 K

π

li

~ Dir(η)

SLIDE 53

30

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚o li

red sox

1 1 1 1 0

ni xt

...

1 2 3 K

π

li

~ Dir(η)

SLIDE 54

30

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚o li

red sox

1 1 1 1 0

ni xt

1

...

1 2 3 K

π

li

~ Dir(η)

SLIDE 55

31

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni 1 𝜚x li

red sox

2 1 1 1 1 0 1

ni xt

...

1 2 3 K

π

li

~ Dir(η)

SLIDE 56

31

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni 1 𝜚x li

red sox

2 1 1 1 1 0 1

ni xt

2

...

1 2 3 K

π

li

~ Dir(η)

SLIDE 57

32

Generative Process

θ1

...

θ2 θ3 θK

Step 1
Generate the number of phones that each letter maps to ( )

ni 1 𝜚 li

red sox

2 1 1 1 1 0 1

ni xt

2

...

1 2 3 K

π

li

~ Dir(η)

li

SLIDE 58

33

Generative Process

θ1

...

θ2 θ3 θK 1 2 li

red sox

2 1 1 1 1 0 1

ni xt ~ Dir(η) 𝜚li

...

1 2 3 K

π

li

Step 2
Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

SLIDE 59

33

Generative Process

θ1

...

θ2 θ3 θK 1 2 li

red sox

2 1 1 1 1 0 1

ni xt ~ Dir(η) 𝜚li

...

1 2 3 K

π

li

Step 2
Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

SLIDE 60

33

Generative Process

θ1

...

θ2 θ3 θK 1 2 li

red sox

2 1 1 1 1 0 1

ni xt ci,p ~ Dir(η) 𝜚li

...

1 2 3 K

π

li

Step 2
Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

SLIDE 61

34

Generative Process

θ1

...

θ2 θ3 θK 2 xt 1 li

red sox

2 1 1 1 1 0 1

ni ci,p ~ Dir(η) 𝜚li

...

1 2 3 K

πr

Step 2
Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

SLIDE 62

35

Generative Process

θ1

...

θ2 θ3 θK 1 2

πr

...

1 2 3 K li

red sox

2 1 1 1 1 0 1

ni ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

Step 2
Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

SLIDE 63

36

Generative Process

θ1

...

θ2 θ3 θK 1 2

πr

...

1 2 3 K li

red sox

2 1 1 1 1 0 1

ni 3 ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

Step 2
Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

SLIDE 64

37

Generative Process

θ1

...

θ2 θ3 θK 1 2

πe

...

1 2 3 K li

red sox

2 1 1 1 1 0 1

ni 3 1 ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

Step 2
Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

SLIDE 65

38

Generative Process

θ1

...

θ2 θ3 θK 1 2

πd

...

1 2 3 K li

red sox

2 1 1 1 1 0 1

ni 3 1 17 ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

Step 2
Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

SLIDE 66

39

Generative Process

θ1

...

θ2 θ3 θK 1 2

πs

...

1 2 3 K li

red sox

2 1 1 1 1 0 1

ni 2 3 1 17 ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

Step 2
Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

SLIDE 67

40

Generative Process

θ1

...

θ2 θ3 θK 1 2

πo

...

1 2 3 K li

red sox

2 1 1 1 1 0 1

ni 2 3 1 17 19 ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

Step 2
Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

SLIDE 68

41

Generative Process

θ1

...

θ2 θ3 θK 1 2

πx

li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19

...

1 2 3 K ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

Step 2
Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

SLIDE 69

42

Generative Process

θ1

...

θ2 θ3 θK 1 2

πx

...

1 2 3 K li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

Step 2
Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

SLIDE 70

Step 3
Generate speech ( )

43

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt ci,p xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

SLIDE 71

Step 3
Generate speech ( )

44

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt ci,p xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

SLIDE 72

Step 3
Generate speech ( )

45

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

SLIDE 73

Step 3
Generate speech ( )

46

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

SLIDE 74

Step 3
Generate speech ( )

47

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

SLIDE 75

Step 3
Generate speech ( )

48

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

SLIDE 76

Step 3
Generate speech ( )

49

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

SLIDE 77

Step 3
Generate speech ( )

50

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

SLIDE 78

Step 3
Generate speech ( )

51

Generative Process

li

red sox

2 1 1 1 1 0 1

θ1

...

θ2 θ3 θK ni 1 2 56 2 3 1 17 19 2

...

1 2 3 K xt xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

SLIDE 79

Step 3
Generate speech ( )

51

Generative Process

li

red sox

2 1 1 1 1 0 1

θ1

...

θ2 θ3 θK ni 1 2 56 2 3 1 17 19 2

...

1 2 3 K xt xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

SLIDE 80

Take context into account for learning L2S mapping rules
More specific rules
Natural back-off mechanism

Context-dependent L2S Rules

52

θ1 θ2 θ3 θ4 ...

...

~DP(γ, ) 𝜚sox 𝜚

~

ci

πo

red sox

πo

...

1 2 3 K

SLIDE 81

Take context into account for learning L2S mapping rules
More specific rules
Natural back-off mechanism

Context-dependent L2S Rules

53

θ1 θ2 θ3 θ4 ...

...

~DP(γ, ) 𝜚sox 𝜚

red sox

...

1 2 3 K

πsox

...

1 2 3 K

~

ci

πsox πo

SLIDE 82

Take context into account for learning L2S mapping rules
More specific rules
Back-off mechanism through hierarchy

Context-dependent L2S Rules

54

...

1 2 3 K

πsox

...

1 2 3 K

πo

SLIDE 83

Take context into account for learning L2S mapping rules
More specific rules
Back-off mechanism through hierarchy

Context-dependent L2S Rules

55

...

1 2 3 K

...

1 2 3 K

~ Dir(απo) πsox πo

SLIDE 84

Take context into account for learning L2S mapping rules
More specific rules
Back-off mechanism through hierarchy

Context-dependent L2S Rules

56

...

1 2 3 K

...

1 2 3 K

View as the prior of
If sox appears frequently
If sox is rarely observed

πo πsox empirical distribution πsox πsox πo ~ Dir(απo) πsox πo

SLIDE 85

Take context into account for learning L2S mapping rules
More specific rules
Back-off mechanism through hierarchy

Context-dependent L2S Rules

57

...

1 2 3 K

...

1 2 3 K

~ Dir(λβ) ~ Dir(𝛿) β

View as the prior of
If sox appears frequently
If sox is rarely observed

πo πsox empirical distribution πsox πsox πo πo ~ Dir(απo) πsox

SLIDE 86

Graphical Model

58

𝛿, λ, α : concentration parameter x : observation speech li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

G : the set of graphemes l : sequence of three graphemes l : observed graphemes d : phone duration n : number of phones a grapheme maps to L : total number of graphemes K : total number of HMMs c : phone id 𝜚l : 3-dim categorical distribution πl,n,p, πl,n,p, β : K-dim categorical distribution θk : a HMM θ0 : HMM prior

SLIDE 87

Inference

59

li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

SLIDE 88

Inference

60

Latent model parameters Regular latent variables li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

SLIDE 89

Inference

60

Latent model parameters Regular latent variables

Procedure
20,000 iterations

li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

SLIDE 90

Inference

60

Latent model parameters Regular latent variables Sample from prior

Procedure
20,000 iterations

li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

SLIDE 91

Inference

60

Latent model parameters Regular latent variables Sample given a Sample from prior

Procedure
20,000 iterations

li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

SLIDE 92

Inference

60

Latent model parameters Regular latent variables Sample given a Sample given a Sample from prior

Procedure
20,000 iterations

li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

SLIDE 93

Inference

60

Latent model parameters Regular latent variables Sample given a Sample given a Sample from prior

Procedure
20,000 iterations

li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

SLIDE 94

Inference

61

Latent model parameters Regular latent variables Sample given a Sample given a Sample from prior

Procedure
10,000 iterations

li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

Block- sampling

SLIDE 95

62

Induce Lexicon and Acoustic Model

li

red sox

xt

and define word pronunciations and phone transcriptions

ni ci

SLIDE 96

63

Induce Lexicon and Acoustic Model

and define word pronunciations and phone transcriptions

ni ci

li

red sox

2 1 1 1 1 0 1

ni ci 56 2 3 1 17 19 2 xt

SLIDE 97

64

Induce Lexicon and Acoustic Model

li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt

and define word pronunciations and phone transcriptions

ni ci

ci

SLIDE 98

64

Induce Lexicon and Acoustic Model

red : 3 1 17 sox : 2 19 56 2 li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt

and define word pronunciations and phone transcriptions

ni ci

ci

SLIDE 99

65

Induce Lexicon and Acoustic Model

red : 3 1 17 sox : 2 19 56 2 56 2 3 1 17 19 2 xt li

red sox

2 1 1 1 1 0 1

ni

and define word pronunciations and phone transcriptions

ni ci

ci

SLIDE 100

65

Induce Lexicon and Acoustic Model

red : 3 1 17 sox : 2 19 56 2 56 2 3 1 17 19 2 xt θ1

...

θ2 θ3 θK li

red sox

2 1 1 1 1 0 1

ni

and define word pronunciations and phone transcriptions

ni ci

ci

SLIDE 101

66

Induce Lexicon and Acoustic Model

red : 3 1 17 sox : 2 19 56 2 θ1

...

θ2 θ3 θK Train a speech recognizer

and define word pronunciations and phone transcriptions

ni ci

SLIDE 102

Experimental Setup

67

Dataset
Jupiter [Zue et al., IEEE Trans. on Speech and Audio Processing, 2000]
Conversational telephone weather information queries
72 hours of training data and 3.2 hours of test data
A subset of 8 hours of the training data used for training our model

SLIDE 103

Experimental Setup

67

Benchmark and baseline
A speech recognizer trained with an expert-crafted lexicon (Supervised)
A grapheme-based recognizer (Grapheme)
Dataset
Jupiter [Zue et al., IEEE Trans. on Speech and Audio Processing, 2000]
Conversational telephone weather information queries
72 hours of training data and 3.2 hours of test data
A subset of 8 hours of the training data used for training our model

SLIDE 104

Experimental Setup

67

Benchmark and baseline
A speech recognizer trained with an expert-crafted lexicon (Supervised)
A grapheme-based recognizer (Grapheme)
A 3-gram language model is used for all experiments
Dataset
Jupiter [Zue et al., IEEE Trans. on Speech and Audio Processing, 2000]
Conversational telephone weather information queries
72 hours of training data and 3.2 hours of test data
A subset of 8 hours of the training data used for training our model

SLIDE 105

Results - Monophone Acoustic Model

68

WER (%) Grapheme 32.7 Our model 17.0 Supervised 13.8

Word error rate (WER)

SLIDE 106

Results - Triphone Acoustic Model

69

Word error rate (WER)
Singleton questions are used to build the decision trees

SLIDE 107

Results - Triphone Acoustic Model

69

WER (%) Grapheme 15.7 Our model 13.4 Supervised 10.0

Word error rate (WER)
Singleton questions are used to build the decision trees

SLIDE 108

Related Work

70

Word pronunciation learning
A segment model based approach to speech recognition [Lee et al., ICASSP 1988]
Lexicon-building methods for an acoustic sub-word based speech recognizer

[Paliwal, ICASSP 1990]

Speech recognition based on acoustically derived segment units [Fukuda et al.,

ICSLP 1996]

Joint lexicon, acoustic unit inventory and model design [Bacchiani and Ostendorf,

Speech Communication 1999]

SLIDE 109

Related Work

70

Word pronunciation learning
A segment model based approach to speech recognition [Lee et al., ICASSP 1988]
Lexicon-building methods for an acoustic sub-word based speech recognizer

[Paliwal, ICASSP 1990]

Speech recognition based on acoustically derived segment units [Fukuda et al.,

ICSLP 1996]

Joint lexicon, acoustic unit inventory and model design [Bacchiani and Ostendorf,

Speech Communication 1999]

Grapheme recognizer
Grapheme based speech recognition [Killer et al., Eurospeech 2003]
A grapheme based speech recognizer for Russian [Stuker and Schultz, SPECOM

2004]

SLIDE 110

Conclusion

71

A joint learning framework for discovering pronunciation lexicon

and acoustic model

Phonetic units are modeled by a HMM-based mixture model
L2S mapping rules are captured by weights over mixtures
L2S rules are tied together through a hierarchical structure

SLIDE 111

Conclusion

71

A joint learning framework for discovering pronunciation lexicon

and acoustic model

Phonetic units are modeled by a HMM-based mixture model
L2S mapping rules are captured by weights over mixtures
L2S rules are tied together through a hierarchical structure
Automatic speech recognition experiments
Outperforms a grapheme-based speech recognizer
Approaches the performance of a recognizer trained with an expert lexicon

SLIDE 112

Conclusion

71

A joint learning framework for discovering pronunciation lexicon

and acoustic model

Phonetic units are modeled by a HMM-based mixture model
L2S mapping rules are captured by weights over mixtures
L2S rules are tied together through a hierarchical structure
Automatic speech recognition experiments
Outperforms a grapheme-based speech recognizer
Approaches the performance of a recognizer trained with an expert lexicon
Apply the lexicon and phone units to existing ASR training methods
Use our model as an initialization

SLIDE 113

72

Thank you.

SLIDE 114

73

li ni xt

t = 1... di p = 1 ... ni

ci,p

i = 1 ... L

and denote an alignment between text and speech

ni ci,p

Sample and ni ci,p

SLIDE 115

74

Sample a new alignment
Compute the probabilities of all possible alignments
Backward message passing with dynamic programming
Forward block-sample new and
Similar to inference for hidden semi-Markov models
and denote an alignment between text and speech

ni ci,p

Sample and ni ci,p

SLIDE 116

74

Sample a new alignment
Compute the probabilities of all possible alignments
Backward message passing with dynamic programming
Forward block-sample new and
Similar to inference for hidden semi-Markov models
and denote an alignment between text and speech

ni ci,p

Sample and ni ci,p

SLIDE 117

74

Sample a new alignment
Compute the probabilities of all possible alignments
Backward message passing with dynamic programming
Forward block-sample new and
Similar to inference for hidden semi-Markov models
and denote an alignment between text and speech

ni ci,p

Sample and ni ci,p

SLIDE 118

74

Sample a new alignment
Compute the probabilities of all possible alignments
Backward message passing with dynamic programming
Forward block-sample new and
Similar to inference for hidden semi-Markov models
and denote an alignment between text and speech

ni ci,p

Sample and ni ci,p

SLIDE 119

74

Sample a new alignment
Compute the probabilities of all possible alignments
Backward message passing with dynamic programming
Forward block-sample new and
Similar to inference for hidden semi-Markov models
and denote an alignment between text and speech

ni ci,p

Sample and ni ci,p

SLIDE 120

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

93 56 61 87 73 99

0.125

11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

93 26 61 87 61

0.125 0.127 0.107 Average entropy (H) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

Refine Induced Lexicon

Pronunciations of Burma

75

B(w)

p(b)

: all pronunciations of a word : pronunciation probability ! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

p(b)

SLIDE 121

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

93 56 61 87 73 99

0.125

11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

93 26 61 87 61

0.125 0.127 0.107 Average entropy (H) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

Refine Induced Lexicon

Pronunciations of Burma

V : vocabulary of the data

76

B(w)

p(b)

: all pronunciations of a word : pronunciation probability ! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

p(b)

SLIDE 122

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

93 56 61 87 73 99

0.125

11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

93 26 61 87 61

0.125 0.127 0.107 Average entropy (H) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

Refine Induced Lexicon

Pronunciations of Burma

V : vocabulary of the data

76

B(w)

p(b)

: all pronunciations of a word : pronunciation probability ! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

p(b)

SLIDE 123

! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

93 56 61 87 73 99

0.125

11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

93 26 61 87 61

0.125 0.127 0.107 Average entropy ( ) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

Refine Induced Lexicon

Pronunciations of Burma

77

p(b)

SLIDE 124

! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

93 56 61 87 73 99

0.125

11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

93 26 61 87 61

0.125 0.127 0.107 Average entropy ( ) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

Refine Induced Lexicon

Pronunciations of Burma

78

p(b)

SLIDE 125

! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

93 56 61 87 73 99

0.125

11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

93 26 61 87 61

0.125 0.127 0.107 Average entropy ( ) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

Refine Induced Lexicon

Pronunciations of Burma

79

*Learning lexicon from speech using a pronunciation mixture model [McGraw et al., 2013]

SLIDE 126

! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

Refine Induced Lexicon

80

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

93 56 61 87 73 99

0.125

11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

93 26 61 87 61

0.125 0.127 0.107 Average entropy ( ) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

Pronunciations of Burma

*Learning lexicon from speech using a pronunciation mixture model [McGraw et al., 2013]

SLIDE 127

! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

Refine Induced Lexicon

81

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

93 56 61 87 73 99

0.125

11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

93 26 61 87 61

0.125 0.127 0.107 Average entropy ( ) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

Pronunciations of Burma

*Learning lexicon from speech using a pronunciation mixture model [McGraw et al., 2013]

SLIDE 128

Take phone position into account

Position-dependent L2S Rules

82

πx

...

1 2 3 K

~

ci

πx

red sox

SLIDE 129

Take phone position into account

Position-dependent L2S Rules

82

πx

...

1 2 3 K

~

ci

πx

red sox

(2,1) (2,2)

SLIDE 130

Take phone position into account

83

red sox

(2,1) (2,2)

~

ci

πx,2,1 ~

ci

πx,2,2

Position-dependent L2S Rules

SLIDE 131

Take phone position into account

83

red sox

(2,1) (2,2)

~

ci

πx,2,1 ~

ci

πx,2,2 πx,2,1

...

1 2 3 K

Position-dependent L2S Rules

SLIDE 132

Take phone position into account

83

red sox

(2,1) (2,2)

~

ci

πx,2,1 ~

ci

πx,2,2 πx,2,1

...

1 2 3 K 56

Position-dependent L2S Rules

SLIDE 133

Take phone position into account

83

red sox

(2,1) (2,2)

~

ci

πx,2,1 ~

ci

πx,2,2 πx,2,2

...

1 2 3 K

πx,2,1

...

1 2 3 K 56

Position-dependent L2S Rules