Acoustic Modeling for Speech Recognition
References:
- 1. X. Huang et. al., Spoken Language Processing, Chapter 8
- 2. The HTK Book (for HTK Version 3.2)
Acoustic Modeling for Speech Recognition Berlin Chen 2003 - - PowerPoint PPT Presentation
Acoustic Modeling for Speech Recognition Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapter 8 2. The HTK Book (for HTK Version 3.2) Introduction X = x x x , ,..., For the given acoustic observation
2
n 2 1
m 2 1
W W W
Language Modeling Acoustic Modeling
N 2 1 i m i 2 1
,.....,v ,v v : V w ,...,w ,..w ,w w where ∈ = W
domain, topic, style, etc. speaker, pronunciation, environment, context, etc.
3
Time Domain
Frequency Domain Modeling the cepstral feature vectors
4
=
M m t k t jm k j
1
=
M m jm
1
5
2 1 exp 2 1 , ;
1 1 2 1 2 1 1
= − = =
⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − Σ − − = = =
M m jm t jm T jm t jm L jm M m jm jm t jm M m t jm jm t j
c N c b c b µ
Σ µ
=
=
M m jm
c
1
1
6
k k L k j L k k j j
1 1
= =
k m k m L k jm M m m L k k m jm M m m j
, , 1 1 1 , 1
= = = =
7
8
9
% 50 4 1 3 sentence correct in the words
No. words Ins.
100% Rate Accuracy Word % 75 4 3 sentence correct in the words
No. words Matched 100% Rate Correction Word % 50 4 2 sentence correct in the words
No. words Ins. Del. Sub. 100% Rate Error Word = − = = = = = = = + + =
matched matched inserted deleted
WER+ WAR =100%
Might be higher than 100% Might be negative
10
//denotes for the word length of the correct/reference sentence //denotes for the word length of the recognized/test sentence minimum word error alignment at the a grid [i,j] /hit /hit kinds of alignment Ref i Test j
11
Direction) (Vertical } Deletion // 2; B[0][j] 1; 1]
G[0][j] ce //referen { m 1,..., j for Direction) l (Horizonta }
//Inserti 1; B[i][0] 1; 1][0]
G[i][0] //test { n 1,..., i for 0; G[0][0] : tion Initializa : 1 Step = + = = = + = = =
test i, //for } reference j, //for } Direction) (Diagonal //match ; 4 Direction) (Diagonal tion //Substitu ; 3 Direction) (Vertical , n //Deletio 2; Direction) l (Horizonta
//Inserti 1; B[i][j] Match) LT[i], LR[i] (if 1]
Substituti LT[i], LR[i]! (if 1 1]
) (Delection 1 1]
) (Insertion 1 1][j]
min G[i][j] ce //referen { m 1,..., j for //test { n 1,..., i for : Iteration : 2 Step ⎪ ⎪ ⎩ ⎪ ⎪ ⎨ ⎧ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ = = + + + = = =
diagonally down go then
Substituti
h //Hit/Matc ; " LR[i] LR[j] " print else down go then , //Deletion ; " LR[j] " print 2 B[i][j] if else left go then n, //Insertio ; LT[i]" " print 1 B[i][j] if B[0][0]) ..... (B[n][m] path backtrace Optimal Rate Error Word % 100 Rate Accuracy Word m G[n][m] 100% Rate Error Word : Backtrace and Measure : 3 Step = = → → = − = × = Note: the penalties for substitution, deletion and insertion errors are all set to be 1 here
12
Correct/Reference Word Sequence Recognized/test Word Sequence 1 2 3 4 5 …. … i … … n-1 n m m-1 . j . . . 4 3 2 1 1Ins. Del.
Del. Del.
grid[0][0].score = grid[0][0].ins = grid[0][0].del = 0; grid[0][0].sub = grid[0][0].hit = 0; grid[0][0].dir = NIL; for (i=1;i<=n;i++) { // test grid[i][0] = grid[i-1][0]; grid[i][0].dir = HOR; grid[i][0].score +=InsPen; grid[i][0].ins ++; } for (j=1;j<=m;j++) { //reference grid[0][j] = grid[0][j-1]; grid[0][j].dir = VERT; grid[0][j].score += DelPen; grid[0][j].del ++; }
1Del. 2Del. 3Del.
(i-1,j-1) (i-1,j) (i,j-1)
13
for (i=1;i<=n;i++) //test { gridi = grid[i]; gridi1 = grid[i-1]; for (j=1;j<=m;j++) //reference { h = gridi1[j].score +insPen; d = gridi1[j-1].score; if (lRef[j] != lTest[i]) d += subPen; v = gridi[j-1].score + delPen; if (d<=h && d<=v) { /* DIAG = hit or sub */ gridi[j] = gridi1[j-1]; gridi[j].score = d; gridi[j].dir = DIAG; if (lRef[j] == lTest[i]) ++gridi[j].hit; else ++gridi[j].sub; } else if (h<v) { /* HOR = ins */ gridi[j] = gridi1[j]; gridi[j].score = h; gridi[j].dir = HOR; ++ gridi[j].ins; } else { /* VERT = del */ gridi[j] = gridi[j-1]; gridi[j].score = v; gridi[j].dir = VERT; ++gridi[j].del; } } /* for j */ } /* for i */
B A B C C C B C A 0 0
(0,0,0,0) (1,0,0,0) (2,0,0,0) (3,0,0,0) (4,0,0,0) (0,1,0,0) (0,2,0,0) (0,3,0,0) (0,4,0,0) (0,5,0,0)
(0,0,1,0) (0,1,1,0) (0,2,0,1) (0,3,0,1) (0,4,0,1) (1,0,0,1) (1,1,0,1)
(1,2,0,1)
(0,2,1,1) (0,3,1,1) (2,0,0,1) (1,0,1,1) (1,1,0,2) (1,2,0,2) (0,3,1,1) (0,2,2,1)
(3,0,0,1) (2,0,0,2) (2,1,0,2)
(1,1,0,3) (1,2,0,3) Delete C Hit C Hit B Del C Hit A Ins B
A C B C C B A B C Test: Correct:
Del c Hit c Hit b Del C Hit A Ins B
Correct Test Still have an Other optimal alignment ! Alignment 1: WER= 60%
14
B A A C C C B C A 0 0
(0,0,0,0) (1,0,0,0) (2,0,0,0) (3,0,0,0) (4,0,0,0) (0,1,0,0) (0,2,0,0) (0,3,0,0) (0,4,0,0) (0,5,0,0)
(0,0,1,0) (0,1,1,0) (0,2,0,1) (0,3,0,1) (0,4,0,1) (1,0,0,1) (1,1,0,1)
(1,2,0,1)
(0,2,1,1) (0,3,1,1) (2,0,0,1) (1,0,1,1) (1,1,1,1) (1,2,1,1) (0,3,1,1) (0,2,2,1)
(3,0,0,1) (2,0,0,2) (2,1,0,2)
(1,1,1,2) (1,2,1,2) Delete C Hit C Sub B Del C Hit A Ins B
A C B C C B A A C Test: Correct:
Del c Hit c Sub B Del C Hit A Ins B
Correct Test A C B C C B A A C Test: Correct:
Del c Hit c Del B Sub C Hit A Ins B
B A A C Test: Correct:
Del c Hit c Sub B Del C Hit A
A C B C C Alignment 1: WER= 80% Alignment 2: WER=80% Alignment 3: WER=80% Note: the penalties for substitution, deletion and insertion errors are all set to be 1 here
15
16
17
subword
18
Syllables (1,345) Base-syllables (408) INITIAL’s (21) FINAL’s (37) Phone-like Units/Phones (33) Tones (4+1)
19
Robustness Enhancement Speaker-independency Speaker-adaptation Speaker-dependency Context-Dependent Acoustic Modeling Pronunciation Variation
20
door)
Pause or intonation information is needed the effect is more important in fast speech
since many phonemes are not fully realized!
21
Statistics of the speaking rates
collected in Taiwan
22
23
24
25
26
allophones: different realizations of a phoneme is called allophones →Triphones are examples of allophones
27
28
29
30
31
32
33
In this example, the tree can be applied to the second state of any /k/ triphone
34
35
36
ㄊㄧㄢ ㄐㄧㄣ ㄐㄧㄢ
37
38
39
from Ming-yi Tsai
40
from Ming-yi Tsai
41
42
, io (ㄧㄛ, e.g., for 唷 was ignored here)
43
44
1 2 3 4
T: tall t: medium-tall M: medium s: medium-sort S: short
45
46
47
t P ω
t P
i
ω
1 =
i t
P ω
48
i i i t t t
49
t t r l t t
t q *
max arg
50
i i i
i
i
i x i X i X
i
2 1 i
51
1/4 Node P H H = ⋅ = ⋅ =
l l l
2 . 1 3/4 6 . 1 Node P H H = ⋅ = ⋅ =
r r r
2 / 1 1/2 1 Node P H H = ⋅ = ⋅ =
l l l
2 / 1 1/2 1 Node P H H = ⋅ = ⋅ =
r r r
1.2 H H H
2 l
= + = 1.0 H H H = + =
r l
52
terminal is t t
53
1
2
1 1 1
, Σ µ N
2 2 2
, Σ µ N
2 1
2 2 2 2 2 2 1 1 1 1 1 1
x x
2 1 2 2 2 1 1 1
t
X
a, b are the sample counts for and
1
X
2
X
See textbook P. 179-180 and complete the derivation Due 12/9