ECE 417 Lecture 20: MP5 Walkthrough
10/31/2019
ECE 417 Lecture 20: MP5 Walkthrough 10/31/2019 Outline Background - - PowerPoint PPT Presentation
ECE 417 Lecture 20: MP5 Walkthrough 10/31/2019 Outline Background things that are done for you Observations: mel-frequency cepstral coefficients (MFCC) Token to type alignment Gaussian surprisal: set_surprisal Scaled
10/31/2019
What you need to know:
the speech-relevant information from the MSTFT (magnitude short- time Fourier transform, 257 dimensions). What you don’t need to know, but here’s the information in case you’re interested: How it’s done.
𝑌( 𝑓*+,-
.
pitch(A)
m(A)
; <88
each filter has a width equal to about 200 mels
narrow filters at low frequency, wider at high frequency
Suppose X is a matrix representing the MSTFT, 𝑌[𝑢, 𝑙] = |𝑌((𝑓*+,-
. )|.
We can compute the filterbank features as 𝐺 = 𝑌𝐼, where H is the matrix of bandpass filters shown here:
MSTFT, 𝑌 (an NFRAMESx257 matrix) Triangle filters, 𝐼 (a 257x24 matrix) Filterbank features, 𝐺 = 𝑌𝐼 (an NFRAMESx24 matrix)
cos 𝜌𝑙7 𝑜7 + 1 2 𝑂
7
cos 𝜌𝑙H 𝑜H + 1 2 𝑂H With a 36th order DCT (up to k1=5,k2=5), we can get a bit more detail about the image.
Suppose F is a matrix representing the mel-scale filterbank features, 𝐺 = 𝑌𝐼. We can compute the mel-frequency cepstral coefficients (MFCC) as 𝑁 = ln 𝐺 𝑈, where T is the DCT matrix:
DCT matrix, 𝑈 (a 24x13 matrix) Log Filterbank features, ln𝐺 (an NFRAMESx24 matrix)
MFCC, M = ln 𝐺 𝑈 (an NFRAMESx13 matrix)
DCT matrix, 𝑈 (a 24x13 matrix) Log Filterbank features, ln𝐺 (an NFRAMESx24 matrix)
MFCC, M = ln 𝐺 𝑈 (an NFRAMESx13 matrix)
This defines the types (distinct phones that are present in the training data) This creates an array tok2type:tok→type This code cuts out the tok2type array for a particular utterance, u, and then computes:
vectors
vectors
probabilities among the tokens of the utterance
Suppose that ⃗ 𝑝 = 𝑝7,…, 𝑝P is a D-dimensional observation vector, and the observation dimensions are uncorrelated (e.g., MFCC). Then we can write the Gaussian pdf as 𝑐
* ⃗
𝑝 = 1 2𝜌Σ* 𝑓S7
H TSUV
WXV YZ TSUV = [
\]7 P
1 2𝜌𝜏
*\ H
𝑓
S7 H T_SUV_
+
`V_
+
Complexity of inverting a DxD matrix: 𝑃{𝐸d} One scalar operation for each
Complexity = 𝑃{𝐸}
be some (as yet unknown) monotonically decreasing function, f(), of its probability: 𝑗(𝑓) = 𝑔(𝑞(𝑓))
information provided by each: 𝑗(𝑓7, 𝑓H) = 𝑗(𝑓7) + 𝑗(𝑓H) There is only one function, f(), that satisfies both of these criteria: 𝑗 𝑓 = −log𝑞(𝑓) 𝑗 𝑓7, 𝑓H = −log𝑞 𝑓7 𝑞 𝑓H = −log𝑞 𝑓7 − log𝑞 𝑓H = 𝑗(𝑓7) + 𝑗(𝑓H)
The “information” provided by observation ⃗ 𝑝 is 𝑗 ⃗ 𝑝 = − log 𝑞( ⃗ 𝑝). But the word “information” has been used for so many purposes that we hesitate to stick with it. There is a more technical-sounding word that is used only for this purpose: “surprisal.” 𝑗 ⃗ 𝑝 = − log 𝑞( ⃗ 𝑝) is the “surprisal” of observation ⃗ 𝑝, because it measures the degree to which we are surprised to observe ⃗ 𝑝.
𝑝 is very likely (𝑞( ⃗ 𝑝) ≈ 1) then we are not surprised (𝑗( ⃗ 𝑝) ≈ 0).
𝑝 is very unlikely (𝑞( ⃗ 𝑝) ≈ 0), then we are very surprised (𝑗( ⃗ 𝑝) ≈ ∞).
10d observation vector Gaussian probability Surprisal Observations: reasonable numbers, easy to work with in floating point Probability densities: Unreasonable numbers, very hard to work with in floating point! Surprisal: reasonable numbers, easy to work with in floating point
WA WARNING: Don’t calculate surprisal using the method on this slide!!! Us Use the method on the next slide!!!
𝑗* ⃗ 𝑝 = − ln 𝑐
* ⃗
𝑝 = − ln [
\]7 P
1 2𝜌𝜏
*\ H
𝑓
S7 H T_SUV_
+
`V_
+
= 1 2 l
\]7 P
𝑝\ − 𝜈*\
H
𝜏
*\ H
+ ln 2𝜌𝜏
*\ H
zero surprise
zero surprise
high surprise
high surprise
𝛽( 𝑘 = l
p]7 q
𝛽(S7 𝑗 𝑏p*𝑐
* ⃗
𝑝( = l
p]7 q
𝛽(S7 𝑗 𝑏p*𝑓SpV Ts
Oh NO! The very small number came back again!
𝛽( 𝑘 ), such that l
*]7 q
t 𝛽( 𝑘 = 1
t 𝛽( 𝑘 = ∑p]7
q
t 𝛽(S7 𝑗 𝑏p*𝑓SpV Ts ∑*]7
q
∑p]7
q
t 𝛽(S7 𝑗 𝑏p*𝑓SpV Ts
𝛾( 𝑗 ), such that l
p]7 q
v 𝛾( 𝑗 = 1
v 𝛾( 𝑗 = ∑*]7
q
𝑏p*𝑓SpV TsxZ v 𝛾(y7 𝑘 ∑p]7
q
∑*]7
q
𝑏p*𝑓SpV TsxZ v 𝛾(y7 𝑘
Notice that the denominator is independent of 𝑗 or 𝑘. So the difference between 𝛽( 𝑘 and t 𝛽( 𝑘 is a scaling factor (let’s call it () that doesn’t depend on 𝑘: t 𝛽( 𝑘 = 1 ( l
p]7 q
t 𝛽(S7 𝑗 𝑏p*𝑓SpV Ts = ⋯ = 𝛽( 𝑘 ∏}]7
(
} Likewise, the difference between 𝛾( 𝑗 and v 𝛾( 𝑗 is some other scaling factor (let’s call it ℎ() that doesn’t depend on 𝑗: v 𝛾( 𝑗 = 1 ℎ( l
*]7 q
𝑏p*𝑓SpV TsxZ v 𝛾(y7 𝑘 = ⋯ = 𝛾( 𝑗 ∏}](y7
So we can calculate gamma as: 𝛿( 𝑘 = 𝛽( 𝑘 𝛾( 𝑘 ∑•]7
q
𝛽( 𝑙 𝛾( 𝑙 = 𝛽( 𝑘 𝛾( 𝑘 / ∏}]7
(
} ∏}](y7
∑•]7
q
𝛽( 𝑙 𝛾( 𝑙 / ∏}]7
(
} ∏}](y7
= t 𝛽( 𝑘 v 𝛾( 𝑘 ∑•]7
q
t 𝛽( 𝑙 v 𝛾( 𝑙 In other words, the scaling (of the scaled forward-backward algorithm) has no effect at all on the calculation of gamma and xi!!
In other words, the scaling (of the scaled forward-backward algorithm) has no effect at all on the calculation of gamma and xi!! 𝛿( 𝑘 = t 𝛽( 𝑘 v 𝛾( 𝑘 ∑•]7
q
t 𝛽( 𝑙 v 𝛾( 𝑙 𝜊( 𝑗, 𝑘 = t 𝛽( 𝑗 𝑏p*𝑓SpV TsxZ v 𝛾(y7 𝑘 ∑•]7
q
∑„]7
q
t 𝛽( 𝑙 𝑏•„𝑓Sp… TsxZ v 𝛾(y7 𝑚
⃗ 𝜈‡ = ∑ˆ]7
‰
∑(]7
𝑝( ∑ˆ]7
‰
∑(]7
⃗ 𝜏‡
H =
∑ˆ]7
‰
∑(]7
⃗ 𝑝( − ⃗ 𝜈‡ H ∑ˆ]7
‰
∑(]7
𝑈𝑄𝑁(𝑛, 𝑜) = ∑ˆ]7
‰
∑(]7
∑ˆ]7
‰
∑(]7
Define the following index variables:
And, for convenience,
𝜏‡
H =Variance vector for the m’th type
from type m to type n
𝑝 directly, without computing 𝑐
* ⃗
𝑝
slides 26 and 27, this allows you to immediately normalize alphahat and betahat so that they each sum to 1.
you get 𝛿( 𝑘 and 𝜊( 𝑗, 𝑘 directly from t 𝛽( 𝑗 and v 𝛾(y7 𝑘 , despite the scaling!
35, the only trick is that you have to be careful about token-to-type mapping.