Joint Learning of Phonetic Units and Word Pronunciations for ASR
Chia-ying (Jackie) Lee, Yu Zhang and James Glass
Spoken Language Systems Group MIT Computer Science and Artificial Intelligence Lab Cambridge, MA
1
Joint Learning of Phonetic Units and Word Pronunciations for ASR - - PowerPoint PPT Presentation
Joint Learning of Phonetic Units and Word Pronunciations for ASR Chia-ying (Jackie) Lee, Yu Zhang and James Glass Spoken Language Systems Group MIT Computer Science and Artificial Intelligence Lab Cambridge, MA 1 World Language Map
1
2
Data source: http://www.ethnologue.com/
Region # of living languages Americas 1,060 Africa 2,146 Europe 284 Asia 2,304 Pacific 1,311
2
Data source: http://www.ethnologue.com/
Region # of living languages Americas 1,060 Africa 2,146 Europe 284 Asia 2,304 Pacific 1,311
2
Data source: http://www.ethnologue.com/
Region # of living languages Americas 1,060 Africa 2,146 Europe 284 Asia 2,304 Pacific 1,311
3
3
[b][p][k] [ae][iy]... Phonetic inventory
3
[b][p][k] [ae][iy]... Phonetic inventory big: [b I g] cat: [k ae t]
Lexicon
3
[b][p][k] [ae][iy]... Phonetic inventory big: [b I g] cat: [k ae t]
Lexicon hello world ... Annotated speech
4
big: [b I g] cat: [k ae t]
hello world ... [b][p][k] [ae][iy]... Phonetic inventory Lexicon
Annotated speech
5
big: [b I g] cat: [k ae t]
hello world ... [b][p][k] [ae][iy]...
Annotated speech
Phonetic inventory Lexicon
6
big: [b I g] cat: [k ae t]
hello world ... [b][p][k] [ae][iy]... Phonetic inventory Lexicon Annotated speech
7
big: [b I g] cat: [k ae t]
hello world ... [b][p][k] [ae][iy]... Phonetic inventory Lexicon Annotated speech
8
big: [b I g] cat: [k ae t]
hello world ... [b][p][k] [ae][iy]... Phonetic inventory Lexicon Annotated speech
9
big: [b I g] cat: [k ae t]
hello world ... [b][p][k] [ae][iy]... Phonetic inventory Lexicon Annotated speech
10
11
12
12
13
14
15
16
17
18
19
19
19
19
19
19
19
19
19
19
19
20
li
21
li
21
li
22
li
23
3-dim categorical distribution
li
23
3-dim categorical distribution
li
24
li
24
li
25
li
26
li
26
li
27
li
27
li
28
li
28
li
29
li
29
li
30
li
30
li
31
li
31
li
32
li
li
33
li
1≤ p ≤ ni
33
li
1≤ p ≤ ni
33
li
1≤ p ≤ ni
34
1≤ p ≤ ni
35
1≤ p ≤ ni
36
1≤ p ≤ ni
37
1≤ p ≤ ni
38
1≤ p ≤ ni
39
1≤ p ≤ ni
40
1≤ p ≤ ni
41
1≤ p ≤ ni
42
1≤ p ≤ ni
43
li ~ Dir(γ)
44
li ~ Dir(γ)
45
li ~ Dir(γ)
46
li ~ Dir(γ)
47
li ~ Dir(γ)
48
li ~ Dir(γ)
49
li ~ Dir(γ)
50
li ~ Dir(γ)
51
li ~ Dir(γ)
51
li ~ Dir(γ)
52
53
54
55
56
57
58
t = 1... di
K
p = 1 ... ni
πl,n,p
G×G
𝜚l
G×G×G
πl,n,p
1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}
i = 1 ... L
59
t = 1... di
K
p = 1 ... ni
πl,n,p
G×G
𝜚l
G×G×G
πl,n,p
1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}
i = 1 ... L
60
t = 1... di
K
p = 1 ... ni
πl,n,p
G×G
𝜚l
G×G×G
πl,n,p
1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}
i = 1 ... L
60
t = 1... di
K
p = 1 ... ni
πl,n,p
G×G
𝜚l
G×G×G
πl,n,p
1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}
i = 1 ... L
60
t = 1... di
K
p = 1 ... ni
πl,n,p
G×G
𝜚l
G×G×G
πl,n,p
1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}
i = 1 ... L
60
t = 1... di
K
p = 1 ... ni
πl,n,p
G×G
𝜚l
G×G×G
πl,n,p
1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}
i = 1 ... L
60
t = 1... di
K
p = 1 ... ni
πl,n,p
G×G
𝜚l
G×G×G
πl,n,p
1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}
i = 1 ... L
60
t = 1... di
K
p = 1 ... ni
πl,n,p
G×G
𝜚l
G×G×G
πl,n,p
1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}
i = 1 ... L
61
t = 1... di
K
p = 1 ... ni
πl,n,p
G×G
𝜚l
G×G×G
πl,n,p
1 ≤ p ≤ n 1 ≤ n ≤ 2 G×{n,p}
i = 1 ... L
62
63
64
64
65
65
66
67
67
67
68
69
69
70
70
71
71
71
72
73
t = 1... di p = 1 ... ni
i = 1 ... L
74
74
74
74
74
pronu ronunciation probabilities abilities
Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125
0.125
0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125
0.125 0.127 0.107 Average entropy (H) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9
75
B(w)
p(b)
: all pronunciations of a word : pronunciation probability ! ≡ −1 |!| ! ! !"#$(!)
!∈!(!) !∈!
! !
p(b)
pronu ronunciation probabilities abilities
Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125
0.125
0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125
0.125 0.127 0.107 Average entropy (H) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9
V : vocabulary of the data
76
B(w)
p(b)
: all pronunciations of a word : pronunciation probability ! ≡ −1 |!| ! ! !"#$(!)
!∈!(!) !∈!
! !
p(b)
pronu ronunciation probabilities abilities
Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125
0.125
0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125
0.125 0.127 0.107 Average entropy (H) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9
V : vocabulary of the data
76
B(w)
p(b)
: all pronunciations of a word : pronunciation probability ! ≡ −1 |!| ! ! !"#$(!)
!∈!(!) !∈!
! !
p(b)
! ≡ −1 |!| ! ! !"#$(!)
!∈!(!) !∈!
! !
pronu ronunciation probabilities abilities
Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125
0.125
0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125
0.125 0.127 0.107 Average entropy ( ) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9
77
p(b)
! ≡ −1 |!| ! ! !"#$(!)
!∈!(!) !∈!
! !
pronu ronunciation probabilities abilities
Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125
0.125
0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125
0.125 0.127 0.107 Average entropy ( ) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9
78
p(b)
! ≡ −1 |!| ! ! !"#$(!)
!∈!(!) !∈!
! !
pronu ronunciation probabilities abilities
Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125
0.125
0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125
0.125 0.127 0.107 Average entropy ( ) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9
79
*Learning lexicon from speech using a pronunciation mixture model [McGraw et al., 2013]
! ≡ −1 |!| ! ! !"#$(!)
!∈!(!) !∈!
! !
80
pronu ronunciation probabilities abilities
Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125
0.125
0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125
0.125 0.127 0.107 Average entropy ( ) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9
*Learning lexicon from speech using a pronunciation mixture model [McGraw et al., 2013]
! ≡ −1 |!| ! ! !"#$(!)
!∈!(!) !∈!
! !
81
pronu ronunciation probabilities abilities
Our model +1 PMM* +2 PMM* 93 56 87 39 19 0.125
0.125
0.125 0.400 0.419 93 20 75 87 17 27 52 0.125 0.125 0.124 55 93 56 61 87 73 84 19 0.125 0.220 0.210 93 26 61 87 49 0.125 0.128 0.140 63 83 86 87 73 53 19 0.125
0.125 0.127 0.107 Average entropy ( ) 4.58 3.47 3.03 WER (%) 17.0 16.6 15.9
*Learning lexicon from speech using a pronunciation mixture model [McGraw et al., 2013]
82
82
83
83
83
83
83