Joint Learning of Phonetic Units and Word Pronunciations for ASR - - PowerPoint PPT Presentation

joint learning of phonetic units and word pronunciations
SMART_READER_LITE
LIVE PREVIEW

Joint Learning of Phonetic Units and Word Pronunciations for ASR - - PowerPoint PPT Presentation

Joint Learning of Phonetic Units and Word Pronunciations for ASR Chia-ying (Jackie) Lee, Yu Zhang and James Glass Spoken Language Systems Group MIT Computer Science and Artificial Intelligence Lab Cambridge, MA 1 World Language Map


slide-1
SLIDE 1

Joint Learning of Phonetic Units and Word Pronunciations for ASR

Chia-ying (Jackie) Lee, Yu Zhang and James Glass

Spoken Language Systems Group MIT Computer Science and Artificial Intelligence Lab Cambridge, MA

1

slide-2
SLIDE 2

2

World Language Map

Data source: http://www.ethnologue.com/

  • Roughly 7,000 living languages all around the world
  • Only 2% are supported by automatic speech recognition (ASR) technology

Region # of living languages Americas 1,060 Africa 2,146 Europe 284 Asia 2,304 Pacific 1,311

slide-3
SLIDE 3

2

World Language Map

Data source: http://www.ethnologue.com/

  • Roughly 7,000 living languages all around the world
  • Only 2% are supported by automatic speech recognition (ASR) technology

Region # of living languages Americas 1,060 Africa 2,146 Europe 284 Asia 2,304 Pacific 1,311

slide-4
SLIDE 4

2

World Language Map

Data source: http://www.ethnologue.com/

  • Roughly 7,000 living languages all around the world
  • Only 2% are supported by automatic speech recognition (ASR) technology

Region # of living languages Americas 1,060 Africa 2,146 Europe 284 Asia 2,304 Pacific 1,311

slide-5
SLIDE 5

3

2% Language Barrier

  • Conventional ASR training is expensive
  • Requires a lot of expert knowledge
slide-6
SLIDE 6

3

2% Language Barrier

[b][p][k] [ae][iy]... Phonetic inventory

  • Conventional ASR training is expensive
  • Requires a lot of expert knowledge
slide-7
SLIDE 7

3

2% Language Barrier

[b][p][k] [ae][iy]... Phonetic inventory big: [b I g] cat: [k ae t]

...

Lexicon

  • Conventional ASR training is expensive
  • Requires a lot of expert knowledge
slide-8
SLIDE 8

3

2% Language Barrier

[b][p][k] [ae][iy]... Phonetic inventory big: [b I g] cat: [k ae t]

...

Lexicon hello world ... Annotated speech

  • Conventional ASR training is expensive
  • Requires a lot of expert knowledge
slide-9
SLIDE 9

4

2% Language Barrier

big: [b I g] cat: [k ae t]

...

hello world ... [b][p][k] [ae][iy]... Phonetic inventory Lexicon

require linguistic expert knowledge difficult to collect

Annotated speech

  • Conventional ASR training is expensive
  • Requires a lot of expert knowledge
slide-10
SLIDE 10

5

2% Language Barrier

big: [b I g] cat: [k ae t]

...

hello world ... [b][p][k] [ae][iy]...

  • Conventional ASR training is expensive
  • Requires a lot of expert knowledge

Annotated speech

easier to generate by non-experts

Phonetic inventory Lexicon

require linguistic expert knowledge difficult to collect

slide-11
SLIDE 11

Towards ASR Training without Experts

6

big: [b I g] cat: [k ae t]

...

hello world ... [b][p][k] [ae][iy]... Phonetic inventory Lexicon Annotated speech

require linguistic expert knowledge difficult to collect easier to generate by non-experts

slide-12
SLIDE 12

Towards ASR Training without Experts

7

big: [b I g] cat: [k ae t]

...

hello world ... [b][p][k] [ae][iy]... Phonetic inventory Lexicon Annotated speech

require linguistic expert knowledge difficult to collect easier to generate by non-experts

slide-13
SLIDE 13

Towards ASR Training without Experts

8

big: [b I g] cat: [k ae t]

...

hello world ... [b][p][k] [ae][iy]... Phonetic inventory Lexicon Annotated speech

require linguistic expert knowledge difficult to collect easier to generate by non-experts

  • Infer lexicon and phonetic units from transcribed speech
slide-14
SLIDE 14

Towards ASR Training without Experts

9

big: [b I g] cat: [k ae t]

...

hello world ... [b][p][k] [ae][iy]... Phonetic inventory Lexicon Annotated speech

require linguistic expert knowledge difficult to collect

  • Infer lexicon and phonetic units from transcribed speech

easier to generate by non-experts

slide-15
SLIDE 15

Discover Pronunciation Lexicon

10

  • Learn word pronunciations from transcribed speech

I need to fly to Texas

slide-16
SLIDE 16

Discover Pronunciation Lexicon

11

[n] [iy] [d] [t][ux] [f] [l] [ay] [t] [ux][t] [e] [k] [s] [ax] [s] [ay] I need to fly to Texas

  • Learn word pronunciations from transcribed speech
slide-17
SLIDE 17

Discover Pronunciation Lexicon

12

[n] [iy] [d] [t][ux] [f] [l] [ay] [t] [ux][t] [e] [k] [s] [ax] [s] [ay] I need to fly to Texas

  • Learn word pronunciations from transcribed speech
slide-18
SLIDE 18

Discover Pronunciation Lexicon

12

[n] [iy] [d] [t][ux] [f] [l] [ay] [t] [ux][t] [e] [k] [s] [ax] [s] [ay] I need to fly to Texas I : need : to : fly : [ay] [n iy d] [t ux] [f l ay] ...

  • Learn word pronunciations from transcribed speech
slide-19
SLIDE 19

Without Linguistic Knowledge

13

ང་གzགས་པོ་sང་ད་གམེད། གས་rེ་ཆེ།

  • Can we discover the word pronunciations?
slide-20
SLIDE 20

Without Linguistic Knowledge

14

  • Can we discover the word pronunciations?

ང་གzགས་པོ་sང་ད་གམེད། གས་rེ་ཆེ།

slide-21
SLIDE 21

Without Linguistic Knowledge

15

ང་གzགས་པོ་sང་ད་གམེད། གས་rེ་ཆེ།

?

  • Can we discover the word pronunciations?
slide-22
SLIDE 22

16

I need to fly to Texas

  • Latent phone sequence
  • Latent letter to sound (L2S) mapping rules

Challenges

slide-23
SLIDE 23

Challenges

17

I need to fly to Texas [n] [iy] [d] [t][ux] [f] [l] [ay] [t] [ux][t] [e] [k] [s] [ax] [s] [ay]

  • Latent phone sequence
  • Latent letter to sound (L2S) mapping rules
slide-24
SLIDE 24

18

I need to fly to Texas [n] [iy] [d] [t][ux] [f] [l] [ay] [t] [ux][t] [e] [k] [s] [ax] [s] [ay]

  • Latent phone sequence
  • Latent letter to sound (L2S) mapping rules

Challenges

slide-25
SLIDE 25

19

Hierarchical Bayesian Model

  • Unknown L2S rules
  • Weights over HMMs
  • Associated with each letter
  • Unknown phone sequence
  • Unknown phone inventory
  • HMM-based mixture model
slide-26
SLIDE 26

19

Hierarchical Bayesian Model

  • Unknown L2S rules
  • Weights over HMMs
  • Associated with each letter
  • Unknown phone sequence
  • Unknown phone inventory
  • HMM-based mixture model
slide-27
SLIDE 27

19

Hierarchical Bayesian Model

  • Unknown L2S rules
  • Weights over HMMs
  • Associated with each letter
  • Unknown phone sequence
  • Unknown phone inventory
  • HMM-based mixture model
slide-28
SLIDE 28

19

Hierarchical Bayesian Model

  • Unknown L2S rules
  • Weights over HMMs
  • Associated with each letter
  • Unknown phone sequence
  • Unknown phone inventory
  • HMM-based mixture model
slide-29
SLIDE 29

19

Hierarchical Bayesian Model

  • Unknown L2S rules
  • Weights over HMMs
  • Associated with each letter
  • Unknown phone sequence
  • Unknown phone inventory
  • HMM-based mixture model

[s] [iy] [z] [k] θ1

...

θ2 θ3 θK θk: HMM

slide-30
SLIDE 30

19

Hierarchical Bayesian Model

  • Unknown L2S rules
  • Weights over HMMs
  • Associated with each letter
  • Unknown phone sequence
  • Unknown phone inventory
  • HMM-based mixture model

[s] [iy] [z] [k] θ1

...

θ2 θ3 θK θk: HMM

slide-31
SLIDE 31

19

Hierarchical Bayesian Model

  • Unknown L2S rules
  • Weights over HMMs
  • Associated with each letter
  • Unknown phone sequence
  • Unknown phone inventory
  • HMM-based mixture model

[s] [iy] [z] [k] θ1

...

θ2 θ3 θK θk: HMM

slide-32
SLIDE 32

19

Hierarchical Bayesian Model

πs

  • Unknown L2S rules
  • Weights over HMMs
  • Associated with each letter
  • Unknown phone sequence
  • Unknown phone inventory
  • HMM-based mixture model

[s] [iy] [z] [k] θ1

...

θ2 θ3 θK θk: HMM

slide-33
SLIDE 33

19

Hierarchical Bayesian Model

πs

...

  • Unknown L2S rules
  • Weights over HMMs
  • Associated with each letter
  • Unknown phone sequence
  • Unknown phone inventory
  • HMM-based mixture model

[s] [iy] [z] [k] θ1

...

θ2 θ3 θK θk: HMM

slide-34
SLIDE 34

19

Hierarchical Bayesian Model

πc πs

...

  • Unknown L2S rules
  • Weights over HMMs
  • Associated with each letter
  • Unknown phone sequence
  • Unknown phone inventory
  • HMM-based mixture model

[s] [iy] [z] [k] θ1

...

θ2 θ3 θK θk: HMM

slide-35
SLIDE 35

19

Hierarchical Bayesian Model

πc

...

πs

...

  • Unknown L2S rules
  • Weights over HMMs
  • Associated with each letter
  • Unknown phone sequence
  • Unknown phone inventory
  • HMM-based mixture model

[s] [iy] [z] [k] θ1

...

θ2 θ3 θK θk: HMM

slide-36
SLIDE 36

20

Generative Process

θ1

...

θ2 θ3 θK

...

1 2 3 K

π

li

slide-37
SLIDE 37

21

Generative Process

θ1

...

θ2 θ3 θK li

red sox

xt

...

1 2 3 K

π

li

slide-38
SLIDE 38

21

Generative Process

θ1

...

θ2 θ3 θK li

red sox

xt

...

1 2 3 K

π

li

slide-39
SLIDE 39

22

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni li

red sox

ni xt

...

1 2 3 K

π

li

slide-40
SLIDE 40

23

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni ni ~ 𝜚li li

red sox

ni

3-dim categorical distribution

xt

...

1 2 3 K

π

li

slide-41
SLIDE 41

23

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni ni ~ 𝜚li li

red sox

ni

3-dim categorical distribution

xt ~ Dir(η)

...

1 2 3 K

π

li

slide-42
SLIDE 42

24

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni li

red sox

ni 1 2 𝜚r xt ~ Dir(η)

...

1 2 3 K

π

li

slide-43
SLIDE 43

24

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni li

red sox

ni 1 2 𝜚r xt ~ Dir(η)

...

1 2 3 K

π

li

slide-44
SLIDE 44

25

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚r li

red sox

1

ni xt

...

1 2 3 K

π

li

~ Dir(η)

slide-45
SLIDE 45

26

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚e li

red sox

1

ni xt

...

1 2 3 K

π

li

~ Dir(η)

slide-46
SLIDE 46

26

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚e li

red sox

1

ni

1

xt

...

1 2 3 K

π

li

~ Dir(η)

slide-47
SLIDE 47

27

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚d li

red sox

1 1

ni xt

...

1 2 3 K

π

li

~ Dir(η)

slide-48
SLIDE 48

27

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚d li

red sox

1 1

ni xt

1

...

1 2 3 K

π

li

~ Dir(η)

slide-49
SLIDE 49

28

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚_ li

red sox

1 1 1

ni xt

...

1 2 3 K

π

li

~ Dir(η)

slide-50
SLIDE 50

28

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚_ li

red sox

1 1 1

ni xt

...

1 2 3 K

π

li

~ Dir(η)

slide-51
SLIDE 51

29

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚s li

red sox

1 1 1 0

ni xt

...

1 2 3 K

π

li

~ Dir(η)

slide-52
SLIDE 52

29

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚s li

red sox

1 1 1 0

ni xt

1

...

1 2 3 K

π

li

~ Dir(η)

slide-53
SLIDE 53

30

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚o li

red sox

1 1 1 1 0

ni xt

...

1 2 3 K

π

li

~ Dir(η)

slide-54
SLIDE 54

30

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni 1 2 𝜚o li

red sox

1 1 1 1 0

ni xt

1

...

1 2 3 K

π

li

~ Dir(η)

slide-55
SLIDE 55

31

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni 1 𝜚x li

red sox

2

1 1 1 1 0 1

ni xt

...

1 2 3 K

π

li

~ Dir(η)

slide-56
SLIDE 56

31

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni 1 𝜚x li

red sox

2

1 1 1 1 0 1

ni xt

2

...

1 2 3 K

π

li

~ Dir(η)

slide-57
SLIDE 57

32

Generative Process

θ1

...

θ2 θ3 θK

  • Step 1
  • Generate the number of phones that each letter maps to ( )

ni 1 𝜚 li

red sox

2

1 1 1 1 0 1

ni xt

2

...

1 2 3 K

π

li

~ Dir(η)

li

slide-58
SLIDE 58

33

Generative Process

θ1

...

θ2 θ3 θK 1 2 li

red sox

2 1 1 1 1 0 1

ni xt ~ Dir(η) 𝜚li

...

1 2 3 K

π

li

  • Step 2
  • Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

slide-59
SLIDE 59

33

Generative Process

θ1

...

θ2 θ3 θK 1 2 li

red sox

2 1 1 1 1 0 1

ni xt ~ Dir(η) 𝜚li

...

1 2 3 K

π

li

  • Step 2
  • Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

slide-60
SLIDE 60

33

Generative Process

θ1

...

θ2 θ3 θK 1 2 li

red sox

2 1 1 1 1 0 1

ni xt ci,p ~ Dir(η) 𝜚li

...

1 2 3 K

π

li

  • Step 2
  • Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

slide-61
SLIDE 61

34

Generative Process

θ1

...

θ2 θ3 θK 2 xt 1 li

red sox

2 1 1 1 1 0 1

ni ci,p ~ Dir(η) 𝜚li

...

1 2 3 K

πr

  • Step 2
  • Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

slide-62
SLIDE 62

35

Generative Process

θ1

...

θ2 θ3 θK 1 2

πr

...

1 2 3 K li

red sox

2 1 1 1 1 0 1

ni ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

  • Step 2
  • Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

slide-63
SLIDE 63

36

Generative Process

θ1

...

θ2 θ3 θK 1 2

πr

...

1 2 3 K li

red sox

2 1 1 1 1 0 1

ni 3 ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

  • Step 2
  • Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

slide-64
SLIDE 64

37

Generative Process

θ1

...

θ2 θ3 θK 1 2

πe

...

1 2 3 K li

red sox

2 1 1 1 1 0 1

ni 3 1 ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

  • Step 2
  • Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

slide-65
SLIDE 65

38

Generative Process

θ1

...

θ2 θ3 θK 1 2

πd

...

1 2 3 K li

red sox

2 1 1 1 1 0 1

ni 3 1 17 ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

  • Step 2
  • Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

slide-66
SLIDE 66

39

Generative Process

θ1

...

θ2 θ3 θK 1 2

πs

...

1 2 3 K li

red sox

2 1 1 1 1 0 1

ni 2 3 1 17 ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

  • Step 2
  • Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

slide-67
SLIDE 67

40

Generative Process

θ1

...

θ2 θ3 θK 1 2

πo

...

1 2 3 K li

red sox

2 1 1 1 1 0 1

ni 2 3 1 17 19 ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

  • Step 2
  • Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

slide-68
SLIDE 68

41

Generative Process

θ1

...

θ2 θ3 θK 1 2

πx

li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19

...

1 2 3 K ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

  • Step 2
  • Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

slide-69
SLIDE 69

42

Generative Process

θ1

...

θ2 θ3 θK 1 2

πx

...

1 2 3 K li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 ci,p xt ~ Dir(η) 𝜚li ~ Dir(γ)

  • Step 2
  • Generate the phone label ( ) for every phone that a letter maps to,

ci,p

1≤ p ≤ ni

slide-70
SLIDE 70
  • Step 3
  • Generate speech ( )

43

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt ci,p xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

slide-71
SLIDE 71
  • Step 3
  • Generate speech ( )

44

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt ci,p xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

slide-72
SLIDE 72
  • Step 3
  • Generate speech ( )

45

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

slide-73
SLIDE 73
  • Step 3
  • Generate speech ( )

46

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

slide-74
SLIDE 74
  • Step 3
  • Generate speech ( )

47

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

slide-75
SLIDE 75
  • Step 3
  • Generate speech ( )

48

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

slide-76
SLIDE 76
  • Step 3
  • Generate speech ( )

49

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

slide-77
SLIDE 77
  • Step 3
  • Generate speech ( )

50

Generative Process

θ1

...

θ2 θ3 θK 1 2

...

1 2 3 K xt li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

slide-78
SLIDE 78
  • Step 3
  • Generate speech ( )

51

Generative Process

li

red sox

2 1 1 1 1 0 1

θ1

...

θ2 θ3 θK ni 1 2 56 2 3 1 17 19 2

...

1 2 3 K xt xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

slide-79
SLIDE 79
  • Step 3
  • Generate speech ( )

51

Generative Process

li

red sox

2 1 1 1 1 0 1

θ1

...

θ2 θ3 θK ni 1 2 56 2 3 1 17 19 2

...

1 2 3 K xt xt ~ Dir(η) 𝜚li

π

li ~ Dir(γ)

ci,p

slide-80
SLIDE 80
  • Take context into account for learning L2S mapping rules
  • More specific rules
  • Natural back-off mechanism

Context-dependent L2S Rules

52

θ1 θ2 θ3 θ4 ...

...

~DP(γ, ) 𝜚sox 𝜚

  • ~

ci

πo

red sox

πo

...

1 2 3 K

slide-81
SLIDE 81
  • Take context into account for learning L2S mapping rules
  • More specific rules
  • Natural back-off mechanism

Context-dependent L2S Rules

53

θ1 θ2 θ3 θ4 ...

...

~DP(γ, ) 𝜚sox 𝜚

  • red sox

...

1 2 3 K

πsox

...

1 2 3 K

~

ci

πsox πo

slide-82
SLIDE 82
  • Take context into account for learning L2S mapping rules
  • More specific rules
  • Back-off mechanism through hierarchy

Context-dependent L2S Rules

54

...

1 2 3 K

πsox

...

1 2 3 K

πo

slide-83
SLIDE 83
  • Take context into account for learning L2S mapping rules
  • More specific rules
  • Back-off mechanism through hierarchy

Context-dependent L2S Rules

55

...

1 2 3 K

...

1 2 3 K

~ Dir(απo) πsox πo

slide-84
SLIDE 84
  • Take context into account for learning L2S mapping rules
  • More specific rules
  • Back-off mechanism through hierarchy

Context-dependent L2S Rules

56

...

1 2 3 K

...

1 2 3 K

  • View as the prior of
  • If sox appears frequently
  • If sox is rarely observed

πo πsox empirical distribution πsox πsox πo ~ Dir(απo) πsox πo

slide-85
SLIDE 85
  • Take context into account for learning L2S mapping rules
  • More specific rules
  • Back-off mechanism through hierarchy

Context-dependent L2S Rules

57

...

1 2 3 K

...

1 2 3 K

~ Dir(λβ) ~ Dir(𝛿) β

  • View as the prior of
  • If sox appears frequently
  • If sox is rarely observed

πo πsox empirical distribution πsox πsox πo πo ~ Dir(απo) πsox

slide-86
SLIDE 86

Graphical Model

58

𝛿, λ, α : concentration parameter x : observation speech li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

G : the set of graphemes l : sequence of three graphemes l : observed graphemes d : phone duration n : number of phones a grapheme maps to L : total number of graphemes K : total number of HMMs c : phone id 𝜚l : 3-dim categorical distribution πl,n,p, πl,n,p, β : K-dim categorical distribution θk : a HMM θ0 : HMM prior

slide-87
SLIDE 87

Inference

59

li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

slide-88
SLIDE 88

Inference

60

Latent model parameters Regular latent variables li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

slide-89
SLIDE 89

Inference

60

Latent model parameters Regular latent variables

  • Procedure
  • 20,000 iterations

li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

slide-90
SLIDE 90

Inference

60

Latent model parameters Regular latent variables Sample from prior

  • Procedure
  • 20,000 iterations

li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

slide-91
SLIDE 91

Inference

60

Latent model parameters Regular latent variables Sample given a Sample from prior

  • Procedure
  • 20,000 iterations

li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

slide-92
SLIDE 92

Inference

60

Latent model parameters Regular latent variables Sample given a Sample given a Sample from prior

  • Procedure
  • 20,000 iterations

li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

slide-93
SLIDE 93

Inference

60

Latent model parameters Regular latent variables Sample given a Sample given a Sample from prior

  • Procedure
  • 20,000 iterations

li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

slide-94
SLIDE 94

Inference

61

Latent model parameters Regular latent variables Sample given a Sample given a Sample from prior

  • Procedure
  • 10,000 iterations

li ni γ η xt

t = 1... di

θk

K

θ0

p = 1 ... ni

β

πl,n,p

G×G

𝜚l

G×G×G

ci,p

πl,n,p

λ

1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}

α

i = 1 ... L

Block- sampling

slide-95
SLIDE 95

62

Induce Lexicon and Acoustic Model

li

red sox

xt

  • and define word pronunciations and phone transcriptions

ni ci

slide-96
SLIDE 96

63

Induce Lexicon and Acoustic Model

  • and define word pronunciations and phone transcriptions

ni ci

li

red sox

2 1 1 1 1 0 1

ni ci 56 2 3 1 17 19 2 xt

slide-97
SLIDE 97

64

Induce Lexicon and Acoustic Model

li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt

  • and define word pronunciations and phone transcriptions

ni ci

ci

slide-98
SLIDE 98

64

Induce Lexicon and Acoustic Model

red : 3 1 17 sox : 2 19 56 2 li

red sox

2 1 1 1 1 0 1

ni 56 2 3 1 17 19 2 xt

  • and define word pronunciations and phone transcriptions

ni ci

ci

slide-99
SLIDE 99

65

Induce Lexicon and Acoustic Model

red : 3 1 17 sox : 2 19 56 2 56 2 3 1 17 19 2 xt li

red sox

2 1 1 1 1 0 1

ni

  • and define word pronunciations and phone transcriptions

ni ci

ci

slide-100
SLIDE 100

65

Induce Lexicon and Acoustic Model

red : 3 1 17 sox : 2 19 56 2 56 2 3 1 17 19 2 xt θ1

...

θ2 θ3 θK li

red sox

2 1 1 1 1 0 1

ni

  • and define word pronunciations and phone transcriptions

ni ci

ci

slide-101
SLIDE 101

66

Induce Lexicon and Acoustic Model

red : 3 1 17 sox : 2 19 56 2 θ1

...

θ2 θ3 θK Train a speech recognizer

  • and define word pronunciations and phone transcriptions

ni ci

slide-102
SLIDE 102

Experimental Setup

67

  • Dataset
  • Jupiter [Zue et al., IEEE Trans. on Speech and Audio Processing, 2000]
  • Conversational telephone weather information queries
  • 72 hours of training data and 3.2 hours of test data
  • A subset of 8 hours of the training data used for training our model
slide-103
SLIDE 103

Experimental Setup

67

  • Benchmark and baseline
  • A speech recognizer trained with an expert-crafted lexicon (Supervised)
  • A grapheme-based recognizer (Grapheme)
  • Dataset
  • Jupiter [Zue et al., IEEE Trans. on Speech and Audio Processing, 2000]
  • Conversational telephone weather information queries
  • 72 hours of training data and 3.2 hours of test data
  • A subset of 8 hours of the training data used for training our model
slide-104
SLIDE 104

Experimental Setup

67

  • Benchmark and baseline
  • A speech recognizer trained with an expert-crafted lexicon (Supervised)
  • A grapheme-based recognizer (Grapheme)
  • A 3-gram language model is used for all experiments
  • Dataset
  • Jupiter [Zue et al., IEEE Trans. on Speech and Audio Processing, 2000]
  • Conversational telephone weather information queries
  • 72 hours of training data and 3.2 hours of test data
  • A subset of 8 hours of the training data used for training our model
slide-105
SLIDE 105

Results - Monophone Acoustic Model

68

WER (%) Grapheme 32.7 Our model 17.0 Supervised 13.8

  • Word error rate (WER)
slide-106
SLIDE 106

Results - Triphone Acoustic Model

69

  • Word error rate (WER)
  • Singleton questions are used to build the decision trees
slide-107
SLIDE 107

Results - Triphone Acoustic Model

69

WER (%) Grapheme 15.7 Our model 13.4 Supervised 10.0

  • Word error rate (WER)
  • Singleton questions are used to build the decision trees
slide-108
SLIDE 108

Related Work

70

  • Word pronunciation learning
  • A segment model based approach to speech recognition [Lee et al., ICASSP 1988]
  • Lexicon-building methods for an acoustic sub-word based speech recognizer

[Paliwal, ICASSP 1990]

  • Speech recognition based on acoustically derived segment units [Fukuda et al.,

ICSLP 1996]

  • Joint lexicon, acoustic unit inventory and model design [Bacchiani and Ostendorf,

Speech Communication 1999]

slide-109
SLIDE 109

Related Work

70

  • Word pronunciation learning
  • A segment model based approach to speech recognition [Lee et al., ICASSP 1988]
  • Lexicon-building methods for an acoustic sub-word based speech recognizer

[Paliwal, ICASSP 1990]

  • Speech recognition based on acoustically derived segment units [Fukuda et al.,

ICSLP 1996]

  • Joint lexicon, acoustic unit inventory and model design [Bacchiani and Ostendorf,

Speech Communication 1999]

  • Grapheme recognizer
  • Grapheme based speech recognition [Killer et al., Eurospeech 2003]
  • A grapheme based speech recognizer for Russian [Stuker and Schultz, SPECOM

2004]

slide-110
SLIDE 110

Conclusion

71

  • A joint learning framework for discovering pronunciation lexicon

and acoustic model

  • Phonetic units are modeled by a HMM-based mixture model
  • L2S mapping rules are captured by weights over mixtures
  • L2S rules are tied together through a hierarchical structure
slide-111
SLIDE 111

Conclusion

71

  • A joint learning framework for discovering pronunciation lexicon

and acoustic model

  • Phonetic units are modeled by a HMM-based mixture model
  • L2S mapping rules are captured by weights over mixtures
  • L2S rules are tied together through a hierarchical structure
  • Automatic speech recognition experiments
  • Outperforms a grapheme-based speech recognizer
  • Approaches the performance of a recognizer trained with an expert lexicon
slide-112
SLIDE 112

Conclusion

71

  • A joint learning framework for discovering pronunciation lexicon

and acoustic model

  • Phonetic units are modeled by a HMM-based mixture model
  • L2S mapping rules are captured by weights over mixtures
  • L2S rules are tied together through a hierarchical structure
  • Automatic speech recognition experiments
  • Outperforms a grapheme-based speech recognizer
  • Approaches the performance of a recognizer trained with an expert lexicon
  • Apply the lexicon and phone units to existing ASR training methods
  • Use our model as an initialization
slide-113
SLIDE 113

72

Thank you.

slide-114
SLIDE 114

73

li ni xt

t = 1... di p = 1 ... ni

ci,p

i = 1 ... L

  • and denote an alignment between text and speech

ni ci,p

Sample and ni ci,p

slide-115
SLIDE 115

74

  • Sample a new alignment
  • Compute the probabilities of all possible alignments
  • Backward message passing with dynamic programming
  • Forward block-sample new and
  • Similar to inference for hidden semi-Markov models
  • and denote an alignment between text and speech

ni ci,p

ni ci,p

Sample and ni ci,p

slide-116
SLIDE 116

74

  • Sample a new alignment
  • Compute the probabilities of all possible alignments
  • Backward message passing with dynamic programming
  • Forward block-sample new and
  • Similar to inference for hidden semi-Markov models
  • and denote an alignment between text and speech

ni ci,p

ni ci,p

Sample and ni ci,p

slide-117
SLIDE 117

74

  • Sample a new alignment
  • Compute the probabilities of all possible alignments
  • Backward message passing with dynamic programming
  • Forward block-sample new and
  • Similar to inference for hidden semi-Markov models
  • and denote an alignment between text and speech

ni ci,p

ni ci,p

Sample and ni ci,p

slide-118
SLIDE 118

74

  • Sample a new alignment
  • Compute the probabilities of all possible alignments
  • Backward message passing with dynamic programming
  • Forward block-sample new and
  • Similar to inference for hidden semi-Markov models
  • and denote an alignment between text and speech

ni ci,p

ni ci,p

Sample and ni ci,p

slide-119
SLIDE 119

74

  • Sample a new alignment
  • Compute the probabilities of all possible alignments
  • Backward message passing with dynamic programming
  • Forward block-sample new and
  • Similar to inference for hidden semi-Markov models
  • and denote an alignment between text and speech

ni ci,p

ni ci,p

Sample and ni ci,p

slide-120
SLIDE 120

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

  • 93 56 61 87 73 99

0.125

  • 11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

  • 93 26 61 87 61

0.125 0.127 0.107 Average entropy (H) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

Refine Induced Lexicon

  • Pronunciations of Burma

75

B(w)

p(b)

: all pronunciations of a word : pronunciation probability ! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

p(b)

slide-121
SLIDE 121

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

  • 93 56 61 87 73 99

0.125

  • 11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

  • 93 26 61 87 61

0.125 0.127 0.107 Average entropy (H) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

Refine Induced Lexicon

  • Pronunciations of Burma

V : vocabulary of the data

76

B(w)

p(b)

: all pronunciations of a word : pronunciation probability ! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

p(b)

slide-122
SLIDE 122

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

  • 93 56 61 87 73 99

0.125

  • 11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

  • 93 26 61 87 61

0.125 0.127 0.107 Average entropy (H) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

Refine Induced Lexicon

  • Pronunciations of Burma

V : vocabulary of the data

76

B(w)

p(b)

: all pronunciations of a word : pronunciation probability ! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

p(b)

slide-123
SLIDE 123

! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

  • 93 56 61 87 73 99

0.125

  • 11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

  • 93 26 61 87 61

0.125 0.127 0.107 Average entropy ( ) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

Refine Induced Lexicon

  • Pronunciations of Burma

77

p(b)

slide-124
SLIDE 124

! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

  • 93 56 61 87 73 99

0.125

  • 11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

  • 93 26 61 87 61

0.125 0.127 0.107 Average entropy ( ) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

Refine Induced Lexicon

  • Pronunciations of Burma

78

p(b)

slide-125
SLIDE 125

! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

  • 93 56 61 87 73 99

0.125

  • 11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

  • 93 26 61 87 61

0.125 0.127 0.107 Average entropy ( ) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

Refine Induced Lexicon

  • Pronunciations of Burma

79

*Learning lexicon from speech using a pronunciation mixture model [McGraw et al., 2013]

slide-126
SLIDE 126

! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

Refine Induced Lexicon

80

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

  • 93 56 61 87 73 99

0.125

  • 11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

  • 93 26 61 87 61

0.125 0.127 0.107 Average entropy ( ) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

  • Pronunciations of Burma

*Learning lexicon from speech using a pronunciation mixture model [McGraw et al., 2013]

slide-127
SLIDE 127

! ≡ −1 |!| ! ! !"#$(!)

!∈!(!) !∈!

! !

Refine Induced Lexicon

81

pronunciation (b)

pronu ronunciation probabilities abilities

pronunciation (b)

Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125

  • 93 56 61 87 73 99

0.125

  • 11 56 61 87 73 99

0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125

  • 93 26 61 87 61

0.125 0.127 0.107 Average entropy ( ) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9

  • Pronunciations of Burma

*Learning lexicon from speech using a pronunciation mixture model [McGraw et al., 2013]

slide-128
SLIDE 128
  • Take phone position into account

Position-dependent L2S Rules

82

πx

...

1 2 3 K

~

ci

πx

red sox

slide-129
SLIDE 129
  • Take phone position into account

Position-dependent L2S Rules

82

πx

...

1 2 3 K

~

ci

πx

red sox

(2,1) (2,2)

slide-130
SLIDE 130
  • Take phone position into account

83

red sox

(2,1) (2,2)

~

ci

πx,2,1 ~

ci

πx,2,2

Position-dependent L2S Rules

slide-131
SLIDE 131
  • Take phone position into account

83

red sox

(2,1) (2,2)

~

ci

πx,2,1 ~

ci

πx,2,2 πx,2,1

...

1 2 3 K

Position-dependent L2S Rules

slide-132
SLIDE 132
  • Take phone position into account

83

red sox

(2,1) (2,2)

~

ci

πx,2,1 ~

ci

πx,2,2 πx,2,1

...

1 2 3 K 56

Position-dependent L2S Rules

slide-133
SLIDE 133
  • Take phone position into account

83

red sox

(2,1) (2,2)

~

ci

πx,2,1 ~

ci

πx,2,2 πx,2,2

...

1 2 3 K

πx,2,1

...

1 2 3 K 56

Position-dependent L2S Rules

slide-134
SLIDE 134
  • Take phone position into account

83

red sox

(2,1) (2,2)

~

ci

πx,2,1 ~

ci

πx,2,2 πx,2,2

...

1 2 3 K

πx,2,1

...

1 2 3 K 56 2

Position-dependent L2S Rules