Structured Discriminative Models for Speech Recognition
Mark Gales - work with Anton Ragni, Austin Zhang, Rogier van Dalen
September 2012
Cambridge University Engineering Department
Symposium on Machine Learning in Speech and Language Processing
Structured Discriminative Models for Speech Recognition Mark Gales - - PowerPoint PPT Presentation
Structured Discriminative Models for Speech Recognition Mark Gales - work with Anton Ragni, Austin Zhang, Rogier van Dalen September 2012 Cambridge University Engineering Department Symposium on Machine Learning in Speech and Language
Mark Gales - work with Anton Ragni, Austin Zhang, Rogier van Dalen
September 2012
Cambridge University Engineering Department
Symposium on Machine Learning in Speech and Language Processing
Structured Discriminative Models for Speech Recognition
Overview
– generative and discriminative models
– discrete and continuous observation forms
– generative score-spaces and log-linear models – efficient feature extraction
– large-margin-based training
– AURORA-2 and AURORA-4 experimental results
Cambridge University Engineering Department MLSLP 2012 1
Structured Discriminative Models for Speech Recognition
Cambridge University Engineering Department MLSLP 2012 2
Structured Discriminative Models for Speech Recognition
Hidden Markov Model - a Generative Model
2 3 4 5
4 T 2 12
a a a33 a a34 a a
22 23 44 45
b b
3 4
() () b2 1 ()
(a) Standard HMM phone topology
t+1
(b) HMM Dynamic Bayesian Network
– observations conditionally independent of other observations given state. – states conditionally independent of other states given previous states. p(O; λ) =
T
P(qt|qt−1)p(ot|qt; λ)
Cambridge University Engineering Department MLSLP 2012 3
Structured Discriminative Models for Speech Recognition
Discriminative Models
– Generative model classification use Bayes’ rule e.g. for HMMs P(w|O; λ) = p(O|w; λ)P(w)
w p(O| ˜
w; λ)P( ˜ w)
P(w|O; α) = 1 Z exp
Z =
w
exp
w)
Cambridge University Engineering Department MLSLP 2012 4
Structured Discriminative Models for Speech Recognition
Example Standard Sequence Models
t
t+1
t+1
HMM MEMM (H)CRF
– maximum entropy Markov model [4] P(q|O) =
T
1 Zt exp
P(q|O) = 1 Z
T
exp
Engineering Department MLSLP 2012 5
Structured Discriminative Models for Speech Recognition
Sequence Discriminative Models
– actually want word posteriors P(w|O)
– motivates the use of structured discriminative models
– motivates the use of sequence kernels to obtain features
– addressed by combining solutions to (1) and (2)
Cambridge University Engineering Department MLSLP 2012 6
Structured Discriminative Models for Speech Recognition
Code-Breaking Style
– perform simpler classification for each segment – complexity determined by segment (simplest word)
ONE ZERO SIL ONE ZERO SIL ONE ZERO SIL FOUR ONE SEVEN
– word start/end
– binary SVMs voting – arg max
ω∈{ONE,...,SIL}
α(ω)Tφ(O{ai}, ω)
– each segment is treated independently – restrict to one segmentation, generated by HMMs
Cambridge University Engineering Department MLSLP 2012 7
Structured Discriminative Models for Speech Recognition
Flat Direct Models
t−1
T
P(w|O) = 1 Z exp
– extracted feature-space becomes vast (number of possible sentences) – associated parameter vector is vast – (possibly) large number of unseen examples
Cambridge University Engineering Department MLSLP 2012 8
Structured Discriminative Models for Speech Recognition
Structured Discriminative Models
τ
j
i+1
– comprises: segmentation identity ai, set of observations O{a} P(w|O) = 1 Z
exp αT
|a|
φ(O{aτ}, ai
τ)
– segmentation may be at word, (context-dependent) phone, etc etc
τ) have?
– must be able to handle variable length O{aτ}
Cambridge University Engineering Department MLSLP 2012 9
Structured Discriminative Models for Speech Recognition
Features
– basic features - second-order statistics (almost) a discriminative HMM – simplest approach extend frame features (for each unit w(k)) [6] φ(O{aτ}, ai
τ) =
. . .
τ, w(k))ot
τ, w(k))ot ⊗ ot
τ, w(k))ot ⊗ ot ⊗ ot
. . . – features have same conditional independence assumption as HMM How to extend range of features?
– number of frames will vary from segment to segment – need to map to a fixed dimensionality independent of number of frames
Cambridge University Engineering Department MLSLP 2012 10
Structured Discriminative Models for Speech Recognition
Cambridge University Engineering Department MLSLP 2012 11
Structured Discriminative Models for Speech Recognition
Sequence Kernel
– also applied in a range of biological applications, text processing, speech – these kernels may be partitioned into three broad classes
– appropriate for text data – string kernels simplest form
– distances between distributions trained on sequences
– parametric form: use the parameters of the generative model – derivative form: use the derivatives with respect to the model parameters
Cambridge University Engineering Department MLSLP 2012 12
Structured Discriminative Models for Speech Recognition
String Kernel
– use a kernel to map from variable to a fixed length; – string kernels are an example for text [9].
c-a c-t c-r a-r r-t b-a b-r φ(cat) 1 λ φ(cart) 1 λ2 λ 1 1 φ(bar) 1 1 λ K(cat, cart) = 1 + λ3, K(cat, bar) = 0, K(cart, bar) = 1
– how to make process efficient (and more general)?
Cambridge University Engineering Department MLSLP 2012 13
Structured Discriminative Models for Speech Recognition
Rational Kernels
– bag-of-words and N-gram counts, gappy N-grams (string Kernel),
a:a/1 b:b/1 a:a/1 b:b/1 a: /1 ε b: /1 ε a: /1 ε b: /1 ε a:ε/λ b:ε/λ 2 1 3/1
The kernel is: K(Oi, Oj) = w
– lattices can be used rather than the 1-best output (Oi).
Cambridge University Engineering Department MLSLP 2012 14
Structured Discriminative Models for Speech Recognition
Generative Score-Spaces
φ(O; λ) = [log(p(O; λ))] – simplest form maps sequence to 1-dimensional score-space
φ(O; λ) = ˆ λ(1) . . . ˆ λ(K) – parameters estimated on O : related to the mean-supervector kernel
φ (O; λ) = [∇λ log (p(O; λ))] – using the appropriate metric this is the Fisher kernel [13]
Cambridge University Engineering Department MLSLP 2012 15
Structured Discriminative Models for Speech Recognition
Cambridge University Engineering Department MLSLP 2012 16
Structured Discriminative Models for Speech Recognition
Combining Discriminative and Generative Models
Test Data
ϕ( , )
O λ
λ
Compensation Adaptation/ Generative Discriminative HMM Canonical O Hypotheses
λ λ
Hypotheses Score−Space Recognition O Hypotheses Final O Classifier
– adapt generative model - speaker/noise independent discriminative model
– log-linear model/logistic regression – binary/multi-class support vector machines
Cambridge University Engineering Department MLSLP 2012 17
Structured Discriminative Models for Speech Recognition
Derivative Score-Spaces
– what about using the sequence-kernel score-spaces? φ(O) = φ(O; λ) – does this help with the dependencies?
∇µ(jm) log(p(O; λ)) =
T
P(qt = {θj, m}|O; λ)Σ(jm)-1(ot − µ(jm)) – state/component posterior a function of complete sequence O – introduces longer term dependencies – different conditional-independence assumptions than generative model
Cambridge University Engineering Department MLSLP 2012 18
Structured Discriminative Models for Speech Recognition
Score-Space Dependencies
– Class ω1: AAAA, BBBB – Class ω2: AABB, BBAA
4 2 3 1 0.5 0.5 0.5 1.0 0.5 P(B)=0.5 P(B)=0.5 P(A)=0.5 P(A)=0.5
Feature Class ω1 Class ω2 AAAA BBBB AABB BBAA Log-Lik
∇2A 0.50
0.33
∇2A∇T
2A
0.17
∇2A∇T
3A
– also true of second derivative within a state
Cambridge University Engineering Department MLSLP 2012 19
Structured Discriminative Models for Speech Recognition
Score-Spaces for ASR
φa
0(O; λ) =
log
. . log
; φb
1µ(O; λ) =
0(O; λ)
– derivative (means only for class ωi): φb
1µ(O; λ)
– log-likelihood (for class ωi): φb
0(O; λ) =
φ(O, a; λ) = |a|
τ=1 δ(ai τ, w(1))φ(O{aτ}; λ)
. . . |a|
τ=1 δ(ai τ, w(P ))φ(O{aτ}; λ)
for α-tied yielding “units” {w(1), . . . , w(P )}, underlying score-space φ(O; λ).
Cambridge University Engineering Department MLSLP 2012 20
Structured Discriminative Models for Speech Recognition
General Feature Extraction
– Consider φ (Oτ:t, wl) for all possible start/end times – T 2 feature evaluations – general complexity O(T 3) – assuming each evaluation O(T) Computationally expensive!
Cambridge University Engineering Department MLSLP 2012 21
Structured Discriminative Models for Speech Recognition
Efficient Extraction using Expectation Semiring
∆
t (j)
αt (j) ∆ αt (j) αt (j) ∆ αt−1
(k)
αt−1
(k)
αt−1
(i)
αt−1
(i)
∆ α
– extend statistics propagated/combined in forward pass – scalar summation extended to vector summation
– derivative features can be computed for any node in the trellis - O(T 2)
Cambridge University Engineering Department MLSLP 2012 22
Structured Discriminative Models for Speech Recognition
Handling Speaker/Noise Differences
– not a problem with generative kernels/score-spaces – adapt generative models using model-based adaptation
– (Constrained) Maximum Likelihood Linear Regression [15] xt = Aot + b; µ(m) = Aµ(m)
x
+ b – Vector Taylor Series Compensation [16] (used in this work) µ(m) = C log
x
+ µ(m)
h
)) + exp(C-1µ(m)
n
)
Cambridge University Engineering Department MLSLP 2012 23
Structured Discriminative Models for Speech Recognition
Cambridge University Engineering Department MLSLP 2012 24
Structured Discriminative Models for Speech Recognition
Simple MMIE Example
−4 −2 2 4 6 8 −1 −0.5 0.5 1 1.5 2 2.5 3 MLE SOLUTION (DIAGONAL) −4 −2 2 4 6 8 −1 −0.5 0.5 1 1.5 2 2.5 3 MMIE SOLUTION
– use to train the discriminative model parameters α
Cambridge University Engineering Department MLSLP 2012 25
Structured Discriminative Models for Speech Recognition
Discriminative Training Criteria
– Conditional Maximum Likelihood (CML) [21, 22]: maximise Fcml(α) = 1 R
R
log(P(w(r)
ref|O(r); α))
– Minimum Classification Error (MCE) [23]: minimise Fmce(α) = 1 R
R
1 + P(w(r)
ref|O(r); α)
ref P(w|O(r); α)
̺
−1
– Minimum Bayes’ Risk (MBR) [24, 25]: minimise Fmbr(α) = 1 R
R
P(w|O(r); α)L(w, w(r)
ref)
Cambridge University Engineering Department MLSLP 2012 26
Structured Discriminative Models for Speech Recognition
Large Margin Based Criteria
ERROR
log−posterior−ratio cost cost cost
BEYOND MARGIN CORRECT
– improves generalisation
min
w=wref
P(wref|O; α) P(w|O; α)
Flm(α) = 1 R
R
w=w(r)
ref
ref) − log
ref|O(r); α)
P(w|O(r); α)
use hinge-loss [f(x)]+. Many variants possible [26, 27, 28, 29]
Cambridge University Engineering Department MLSLP 2012 27
Structured Discriminative Models for Speech Recognition
Relationship to (Structured) SVM
F(α) = log (N(α; µα; Σα)) + Flm(α)
– restrict parameters of the prior: N(α; µα; Σα) = N(α; 0, CI) F(α) = 1 2||α||2 + C R
R
w=w(r)
ref
ref) − log
ref; λ)
αTφ(O(r), w; λ)
Cambridge University Engineering Department MLSLP 2012 28
Structured Discriminative Models for Speech Recognition
Structured SVM Training
Training Sample 1
, “1 2 3” , “0 0 0”
Training Sample 1
, , 0 0 0
(1)
O
, “1 2 3” , “9 9 9” “1 2 3”
, O
(1) ref
w
Training Sample n
, “4 5 6” , “0 0 0”
Training Sample n
, 4 5 6 ,
( ) n
O
, “4 5 6” , “9 9 9” “4 5 6”
( ), n
O
( ) ref n
w
4 5 6
ref
w
, ,
“A
,
“A
,
1 2||α||2 + C R
n
linear
ref) + convex
w=w(r)
ref
ref) + αTφ(O(r), w) +
Cambridge University Engineering Department MLSLP 2012 29
Structured Discriminative Models for Speech Recognition
Handling Latent Variables
– for SSVM necessary to use the “best” segmentation
ˆ ahmm = argmax
a
{log (P(a|O, w; λ))} = argmax
a
{log (P(O|a, w; λ)P(a|w; λ))} – equivalent of phone/word-marking lattices – BUT underlying model changes: would like ˆ a = argmax
a
{log (P(O|a, w; λ, α)) + log (P(a|w; λ, α))} Maps into a Concave-Convex Procedure (CCCP) [34]
a
αTφ(O(i), w(i)
ref, a) + convex
w=wref,a
ref) + αTφ(O(i), w, a) +
Cambridge University Engineering Department MLSLP 2012 30
Structured Discriminative Models for Speech Recognition
Cambridge University Engineering Department MLSLP 2012 31
Structured Discriminative Models for Speech Recognition
Preliminary Evaluation Tasks
– whole-word models, 16 emitting-states with 3 components per state – clean training data for HMM training - HTK parametrisation SNR – Set B and Set C unseen noise conditions even for multi-style data – Noise estimated in a ML-fashion for each utterance
– training data from WSJ0 SI84 to train clean acoustic models – state-clustered states, cross-word triphones (≈3K states ≈50k components) – 5-15dB SNR range of noises added – Noise estimated in a ML-fashion for each utterance
– don’t compare results cross-tables!
Cambridge University Engineering Department MLSLP 2012 32
Structured Discriminative Models for Speech Recognition
AURORA-2 - Training Criterion
Model Criterion Test set Avg A B C HMM — 9.8 9.1 9.5 9.5 LLM CML 8.1 7.7 8.3 8.1 (φa
0)
MWE 7.9 7.4 8.2 7.9 LM 7.8 7.3 8.0 7.6
– very few additional parameters added (12 × 12 = 144) for log-linear models (though these parameters are discriminatively trained
Cambridge University Engineering Department MLSLP 2012 33
Structured Discriminative Models for Speech Recognition
AURORA-2 - Support Vector Machines
Model Features Test set Avg A B C HMM — 9.8 9.1 9.5 9.5 SVM φa 9.1 8.7 9.2 9.0 MSVM 8.3 8.1 8.6 8.3 SSVM 7.8 7.3 8.0 7.6
– segmentation for SVMs and multi-class SVMs (MSVMs) obtained from HMM – majority voting (HMM decision for ties on standard SVM)
– does have an important on the performance
Cambridge University Engineering Department MLSLP 2012 34
Structured Discriminative Models for Speech Recognition
AURORA-2 - Derivative Score-Spaces - MWE Criterion
HMM SDM ˆ a Test set Avg A B C VTS – – 9.8 9.1 9.5 9.5 φb
1µ
ˆ ahmm 7.0 6.6 7.6 7.0 ˆ a 6.8 6.4 7.3 6.7 VAT – – 8.9 8.3 8.8 8.6 φb
1µ
ˆ ahmm 6.6 6.5 7.0 6.6 ˆ a 6.2 6.1 6.8 6.3 DVAT – – 6.7 6.6 7.0 6.7 φb
1µ
ˆ ahmm 6.1 6.2 6.7 6.3 ˆ a 6.1 6.1 6.6 6.2
1µ) consistent gains over all baseline HMM systems
– derivative score-space larger (1873 dimensions for each base score-space) – adds approximately 50% more parameters to the system
Cambridge University Engineering Department MLSLP 2012 35
Structured Discriminative Models for Speech Recognition
AURORA-4 - Derivative Score-Space - MPE Criterion
System Test set Avg A B C D VTS 7.1 15.3 12.1 23.1 17.9 VAT 8.6 13.8 12.0 20.1 16.0 DVAT 7.2 12.8 11.5 19.7 15.3 VAT+φb 7.7 13.1 11.0 19.5 15.3 VAT+φb
1µ
7.4 12.6 10.7 19.0 14.8
– single dimension space (φb
0) with VAT system yields DVAT performance
– need to look at DVAT+φb
1µ (need to try on more data)
Cambridge University Engineering Department MLSLP 2012 36
Structured Discriminative Models for Speech Recognition
Conclusions
– use generative models to derive features for discriminative model – robustness and adaptation achieved by adapting underlying acoustic model
– different conditional independence assumptions to underlying model – systematic way to incorporate different dependencies into model
– yields structured SVM (use standard optimisation code) – still an issue scaling to large tasks/score-spaces Interesting classifier options - without throwing away HMMs
Cambridge University Engineering Department MLSLP 2012 37
Structured Discriminative Models for Speech Recognition
Acknowledgements
– Cambridge Research Lab, Toshiba Research Europe Ltd – EPSRC - Generative Kernels and Score-Spaces for Classification of Speech
Cambridge University Engineering Department MLSLP 2012 38
Structured Discriminative Models for Speech Recognition
References
[1]
[2]
Proceedings of ICASSP, 2011. [3]
“Support vector machines for segmental minimum Bayes risk decoding of continuous speech,” in ASRU 2003, 2003. [4] H-K. Kuo and Y. Gao, “Maximum entropy direct models for speech recognition,” IEEE Transactions Audio Speech and Language Processing, 2006. [5]
2005. [6]
uter, and H. Ney, “Investigations on features for log-linear acoustic models in continuous speech recognition,” in Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on, 2009, pp. 52–57. [7]
Procesing, vol. 4, pp. 994–1006, 2010. [8]
Magazine, 2012. [9]
Learning Research, vol. 2, pp. 419–444, 2002. [10]
[11] Layton MI and MJF Gales, “Acoustic modelling using continuous rational kernels,” Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, August 2007. [12] N.D. Smith and M.J.F. Gales, “Speech recognition using SVMs,” in Advances in Neural Information Processing Systems, 2001. [13]
Systems 11, S.A. Solla and D.A. Cohn, Eds. 1999, pp. 487–493, MIT Press. [14]
Cambridge University Engineering Department MLSLP 2012 39
Structured Discriminative Models for Speech Recognition [15] M J F Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech and Language, vol. 12, pp. 75–98, 1998. [16] A Acero, L Deng, T Kristjansson, and J Zhang, “HMM Adaptation using Vector Taylor Series for Noisy Speech Recognition,” in
[17]
[18]
152–157. [19]
[20]
[21] P.S. Gopalakrishnan, D. Kanevsky, A. N´ adas, and D. Nahamoo, “An inequality for rational functions with applications to some statistical estimation problems,” IEEE Trans. Information Theory, 1991. [22]
Speech & Language, vol. 16, pp. 25–47, 2002. [23] B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Transactions on Signal Processing, 1992. [24]
in Proc. ICSLP, 2000. [25]
Statistical Modelling for Speech Recognition, 2006. [26]
[27]
[28] G Heigold, T Deselaers, R Schluter, and H Ney, “Modified MMI/MPE: A direct evaluation of the margin in speech recognition,” in
[29] G Saon and D Povey, “Penalty function maximization for large margin HMM training,” in Proc. Interspeech, 2008. [30] S.-X. Zhang, Anton Ragni, and M. J. F. Gales, “Structured log linear models for noise robust speech recognition,” Signal Processing Letters, IEEE, vol. 17, pp. 945–948, 2010. Cambridge University Engineering Department MLSLP 2012 40
Structured Discriminative Models for Speech Recognition [31] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun, “Large margin methods for structured and interdependent output variables,” J. Mach. Learn. Res., vol. 6, pp. 1453–1484, 2005. [32] Thorsten Joachims, Thomas Finley, and Chun-Nam John Yu, “Cutting-plane training of structural SVMs,” Mach. Learn., vol. 77, no. 1, pp. 27–59, 2009. [33] S.-X. Zhang and M. J. F. Gales, “Extending noise robust structured support vector machines to larger vocabulary tasks,” in Proc. ASRU, 2011. [34] Chun-Nam Yu and Thorsten Joachims, “Learning structural SVMs with latent variables,” in Proceedings of ICML, 2009. [35] W.M. Campbell, D.Sturim, D.A. Reynolds, and A. Solomonoff, “SVM based speaker verification using a GMM supervector kernel and NAP variability compensation,” in Proc. ICASSP, 2006. Cambridge University Engineering Department MLSLP 2012 41
Structured Discriminative Models for Speech Recognition
Distributional Kernels
– using the available estimate a distribution given the sequence λ(i) = argmax
λ
{log(p(Oi; λ))}
– Kullback-Leibler divergence: KL(fi||fj) =
fi(O) fj(O)
– Bhattacharyya affinity measure: B(fi||fj) = fi(O)fj(O) dO
Cambridge University Engineering Department MLSLP 2012 42
Structured Discriminative Models for Speech Recognition
Joint Feature-Space Example
“ONE”
… …
+
Generative features
… …
1
log ( ; ) P λ
! " #
( ; ) P λ
! " # " #
… … …
K
( ) λ
" # " #
…
“TWO”
…
K
log
( ; ) P λ
" # $ %
: “ONE” “ONE” “TWO” W:
“Th ”
“ONE” ONE TWO W:
“Three”
Cambridge University Engineering Department MLSLP 2012 43
Structured Discriminative Models for Speech Recognition
GMM Mean-Supervector Kernel
– use symmetric KL-divergence: KL(fi||fj) + KL(fj||fi) – use matched pair KL-divergence approximation – GMM distributions only differ in terms of the means – use polarisation identity
K(Oi, Oj; λ) =
M
cmµ(im)TΣ(m)-1µ(jm) – µ(im) is the mean (ML or MAP) for component m using sequence Oi
– BUT required to explicitly operate in feature-space
Cambridge University Engineering Department MLSLP 2012 44
Structured Discriminative Models for Speech Recognition
AURORA-2 - Optimising Segmentation
Model Training Segmentation Test set Avg {trn, tst} A B C HMM — — 9.8 9.1 9.5 9.5 SSVM n-slack {ˆ ahmm, ˆ ahmm} 7.8 7.3 8.0 7.6 {ˆ ahmm, ˆ a} 7.6 7.2 8.0 7.5 SSVM n-slack {ˆ ahmm, ˆ ahmm} 7.9 7.4 8.2 7.8 batch {ˆ ahmm, ˆ a} 7.8 7.2 8.0 7.6 {ˆ a, ˆ a} 7.6 7.1 7.8 7.4 SSVM 1-slack {ˆ ahmm, ˆ a} 7.6 7.3 7.9 7.5
– n-slack batch and 1-slack schemes similar to full approach
Cambridge University Engineering Department MLSLP 2012 45
Structured Discriminative Models for Speech Recognition
AURORA-4 - Structured SVM Results
– 1-slack variable training – prior distribution matched to score-space φa
0, mean set to 1/LM − scale
– α tied at the monophone-level (47-classes) Model Segmentation Test set Avg {trn, tst} A B C D HMM — 7.1 15.3 12.1 23.1 17.9 SSVM {ˆ ahmm, ˆ ahmm} 7.5 14.3 11.4 21.9 16.9 {ˆ ahmm, ˆ a} 7.4 14.2 11.3 21.9 16.8
– disappointing gain from segmentation - though only in test at the moment – working on optimal training segmentation as well
Cambridge University Engineering Department MLSLP 2012 46
Structured Discriminative Models for Speech Recognition
AURORA-4 - Derivative Score-Space
Classes System Comp Test set Avg tied α A B C D VTS 7.1 15.3 12.1 23.1 17.9 47 φb
1µ
yes 7.5 14.1 11.3 21.6 16.6 no 7.4 14.3 11.7 21.9 16.9 4020 φb
1µ
yes 6.8 13.7 10.6 21.3 16.2 no 6.7 13.5 10.2 21.1 16.0
– derivative score-spaces give large gains over (ML VTS) baseline
Cambridge University Engineering Department MLSLP 2012 47
Structured Discriminative Models for Speech Recognition
Cambridge University Engineering Department MLSLP 2012 48
Structured Discriminative Models for Speech Recognition
Standard HMM Algorithms
State Time
j t
– based on forward-backward/Viterbi algorithms γ(j)
t
= P(q(j)
t |O1:T; λ) =
1 p(O1:T; λ) · p(O1:t, q(j)
t ; λ) · p(Ot + 1:T|q(j) t ; λ)
– time/memory requirement O(T) + O(T)
Cambridge University Engineering Department MLSLP 2012 49
Structured Discriminative Models for Speech Recognition
Structured Discriminative Models
t+1 t+2
T
−1
τ
+1
P(w1:L|O1:T; α) = 1 Z
exp αT
|a|
φ
τ
– alignment unknown marginalised over in training (or 1-best taken)
– need to use sequence kernel or score-space
Cambridge University Engineering Department MLSLP 2012 50
Structured Discriminative Models for Speech Recognition
Forward/Backward Caching
– compute backward probabilities – O(T) possible backward passes – intersect of forward/backward yields required posterior
Cambridge University Engineering Department MLSLP 2012 51
Structured Discriminative Models for Speech Recognition
Segmentation
k
...
t ... i
j
– sentence: yields flat direct model - standard problems – word: easy implementation for small vocab, sparsity issues – phone: may be context-dependent – state: very flexible, but large number of segments
– multiple segmentations can be used to derive features
Cambridge University Engineering Department MLSLP 2012 52
Structured Discriminative Models for Speech Recognition
Approximate Training/Inference Schemes
– simplest approach use Viterbi (1-best) segmentation from HMM, ˆ ahmm – use fixed segmentation in training and test - highly efficient P(w|O) ≈ 1 Z
|ˆ ahmm|
exp
ahmmτ}, ˆ
ai
hmmτ)
ahmm = argmax
a
{p(O|a, λ)P(a)}
– unclear how accurate/appropriate this is for ASR
Cambridge University Engineering Department MLSLP 2012 53