Bayesian and Discriminative Speaker Adaptation
Chih Chih-
- Hsien
Hsien Huang Huang
Supervisor : Prof. Jen-Tzung Chien National Cheng Kung University
Bayesian and Discriminative Speaker Adaptation Chih- -Hsien Hsien - - PowerPoint PPT Presentation
Bayesian and Discriminative Speaker Adaptation Chih- -Hsien Hsien Huang Huang Chih Supervisor : Prof. Jen-Tzung Chien National Cheng Kung University Outline INTRODUCTION INTRODUCTION INTRODUCTION LARGE VOCABULARY CONTINUOUS
Supervisor : Prof. Jen-Tzung Chien National Cheng Kung University
2
INTRODUCTION LARGE VOCABULARY CONTINUOUS SPEECH
RECOGNITION
CONTRIBUTIONS OF DISSERTATION BAYESIAN DURATION ADAPTATION
INTRODUCTION
LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION RECOGNITION
KEYPOINTS OF THIS TALK
BAYESIAN DURATION ADAPTATION
DISCRIMINATIVE LINEAR REGRESSION ADAPTATION ADAPTATION
EXPERIMENTS
CONCLUSION AND FUTURE WORKS
4
Speech communication is one of the basic and essential capabilities of human beings.
Speech is the only way to exchange information without any
tools.
Speech control is natural on mobile devices.
Automatic speech recognition is important to broadcast news transcription.
High performance automatic speech recognition
recognition and summarization summarization is desirable.
6
State-of-the-art speech recognizer is based on hidden Markov
hidden Markov models models (HMMs).
Parameter estimation is performed through EM algorithm
EM algorithm.
Decoding rule is according to MAP criterion
MAP criterion.
Goal of speech recognizer is to minimize the classification
classification error error.
7
Bayes rule MAP decoding criterion
) ( ) ( ) | ( ) | ( X X X P W P W P W P = ) ( ) ( max arg ˆ W P W P W
W
X =
8
Left-to-Right HMM Parameters of HMM
Initial probabilities Transition probabilities Output probabilities
Mixture of Gaussians
12
a
∑
=
Σ =
M m jm jm jm j
N c b
1
) , ( ) ( μ x x
{ }
i
π π =
λ
ij
a A =
{ }
) (⋅ =
i
b B
1 2 3
23
a
11
a
22
a
33
a
1
b
2
b
3
b
Feature Extraction Large Vocabulary Continuous Speech Recognition Hidden Markov models n-gram language models Feature Vectors
Recognition results
Lexicon Tree Speech Signal
10
ㄅ ㄧ 逼,弊,... ㄢ 辦,搬,... ㄍ ㄨㄥ 辦公 ㄔ ㄥ ㄍ ㄉ ㄚ ㄒ ㄩㄝ 成,程,... ㄨㄥ 成功 ㄓ ㄤ 成長 成功大學
11
J(k) 1 j Observation 1 t T The jth state of the kth subsyllable
States J(k) 1 J(k') Observation 1 t T The kth subsyllable The k'th subsyllable 1
Transitions within subsyllable Transitions across subsyllables
) ' , , 1 ( max ) ( ) , , (
' 1 ,
j k t Q t b j k t Q
j j j j k
− + =
≤ ≤ −
) , , 1 ( ), , ' , 1 ( max max ) ( ) , , (
' ' 1 ,
k t Q J k t Q t b k t Q
k K k j k
− − + =
≤ ≤
12
無從 從此 台視 開始 了 樂 我 窩 在 災 在職
V trees V trees V trees V trees V trees
P(‧| 從此) P(‧| 無從) P(‧|台視) P(‧|樂) P(‧| 窩) P(‧|在職) P(‧|在)
清華
P(‧|開始) P(‧|了) P(‧|我) P(‧| 在)
Language Model Look-ahead Acoustic Look-ahead Q(word history, arc, state)
P(‧| sil) P(‧| 清華)
A B C sil A B C sil A B C sil A B C sil A B C sil A B C sil A B C sil A B C sil A B C sil A B C sil acoustic model language model acoustic model t t
14
Proceed from left to right over time t Acoustic level: process states of lexical trees Initialization: Time alignment: Propagate back pointers Prune unlikely hypotheses For each pair Store best boundary Store best predecessor Word pair level: process word ends
) 1 ; ( ) , 1 ( − = = − t v H s t Qv 1 ) , 1 ( − = = − t s t Bv
{ }
) , 1 ( ) | , ( max ) , ( s t Q s s x p s t Q
v t s v
′ − ⋅ ′ =
′
) , ( s t Bv ) ; ( t w
{ }
) , ( ) | ( max ) ; (
w v v
s t Q v w p t w H ⋅ =
{ }
) , ( ) | ( max arg ) ; (
w v v
s t Q v w p t w v ⋅ =
) ; ( t w v v =
) , (
w v v
S t B = τ
15
Many mismatch sources
mismatch sources exist between training and test data in real applications.
Most popular technique is to conduct speaker/environment
speaker/environment adaptation adaptation.
Maximum a posteriori (MAP) Speaker clustering Linear regression
Speech Database Acoustic Models Speaker indepent Speaker Adapted Training Testing
Adaptation Data
M I S M A T C H
17
Bayesian Duration Adaptation
Parametric duration modeling
Gaussian, Poisson and gamma distributions
Joint sequential learning of acoustic model and duration model QB estimates of Gaussian and Poisson duration models were
formulated.
Reproducible prior/posterior property was exploited.
18
Aggregate a Posteriori Linear Regression
Robustness
Considering the prior information of regression matrix The relation of AAPLR and MAPLR was illustrated.
Discriminative adaptation
The AAP criterion can be represented as the form of minimum error
rate.
Rapid adaptation
AAPLR has closed-form solution. It is superior to traditional discriminative adaptation. (MCELR)
20
Speaking rate is one of the mismatch sources between training and testing.
In standard HMM, the state duration is represented with
transition probability transition probability.
Non-parametric approaches
Ferguson explicitly modeled the duration Too many parameters
Parametric approaches
Russell and Moore applied Poisson distribution Levinson applied gamma distribution
21
HMM parameter set is extended with state duration
Initial state probability, Transition probability, Observation density, Duration density,
Maximum likelihood criterion
22
2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Duration Length Relative Frequency(%)
23
2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Duration Length Relative Frequency(%)
2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Duration Length Relative Frequency(%)
2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Duration Length Relative Frequency(%) empirical distribution Geometric distribution Gaussian distribution Poisson distribution gamma distribution
26
Duration models and their prior distributions
Gaussian distribution with Gaussian prior Gaussian prior
Poisson distribution with gamma prior gamma prior
Gamma distribution with Gaussian prior Gaussian prior
Estimation Criteria
ML estimation MAP estimation QB estimation
27
Auxiliary Q-function
28
Gaussian Duration Parameters
Poisson Duration Parameters
Gamma Duration Parameters
30
MAP batch learning
Risk function
31
Risk function
32
for the parameter η
33
for the parameter
No closed-
form solution exists.
Newton’s algorithm can be applied.
ν
34
Gaussian Duration with Gaussian prior QB estimate is obtained by
35
Poisson duration with gamma prior E-step
36
Gamma hyperparameters :
Poisson parameters :
38
Distribution estimation and discriminative training discriminative training are two categories of HMM parameter estimation approach.
Distribution estimation
Maximum likelihood criterion
Maximum a posteriori a posteriori criterion
Discriminative training
Minimum classification error (MCE MCE) criterion
Maximum mutual information (MMI MMI) criterion
39
MCE
MMI
⎭ ⎬ ⎫ ⎩ ⎨ ⎧ − + − =
Ω ∈
m
j j j m m m
M d )] ; ( exp[ 1 1 log ) ; ( ) , ( λ λ λ X X X g g )) , ( exp( 1 1 )) , ( ( ) ; (
m m m
d d λ λ λ X X X − + = = l l
) ; (
) ( ) ( ) 1 ( k m k m k m
U λ ε λ λ X l ∇ − =
+
M M d W p W p W p p W p W p W I
m M j j j m m m m
log ))) 1 log( ) , ( ( ( log( ) ( ) ( log ) ( log ) ( ) ( ) , ( log ) , (
1
+ − + − = − = =
=
λ X X X X X X l
40
Model transformation (e.g. MLLR, MAPLR) Initial models Adapted models Regression class Class 1 Class 2
41
Linear transformation MLLR
Solution m m r m
ξ μ
) (
ˆ W =
) , ( max arg
ML
Λ = W X W
W
p
1 2 , 2 ML
) ( ) (
− Ω ∈ Ω ∈
⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ =
t m T m m mi t t m T m i t mi t ri
r r
m x m ξ ξ σ ς ξ σ ς w
42
Prior of regression matrices Solution
) ( ) , ( max arg ) , ( max arg
MAP
W W X X W W
W W
g p p Λ = Λ =
⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − Σ − ⋅ Δ ∝
= − − d i T ri ri ri ri ri r r
g
1 1 2 / 1
) ( ) ( exp ) ( m w m w W
1 1 2 1 , 2 MAP
) ( ) (
− − Ω ∈ − Ω ∈
⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ Σ + ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ Σ + =
ri t m T m m mi t ri ri t m T m i t mi t ri
r r
m x m ξ ξ σ ς ξ σ ς m w
43
MCELR
Discriminant function function
Solution
CMLLR
Solution
Ω ∈
=
t m m r t t r r
r
p m g ) , ( log ) ( ) ; ( λ ς W x W X
⎪ ⎭ ⎪ ⎬ ⎫ ⎥ ⎥ ⎦ ⎤ − − − ⎪ ⎩ ⎪ ⎨ ⎧ ⎢ ⎣ ⎡ − × − − =
Ω ∈ Ω ∈ +
) )( ( ) ( ) ( )) ; ( 1 )( ; (
2 ) ( , 2 ) ( , ) ( ) ( ) ( ) 1 (
r r
j T j ji j k ri i t t T m mi m k ri i t t m t k ri k ri k ri k ri
x j x m ξ σ ξ ς ξ σ ξ ς ε w w w X w X w w l l
1 2 2 , CML
)) ( ) ( ( )) ( ) ( (
− Ω ∈ Ω ∈
⎪ ⎬ ⎫ ⎪ ⎨ ⎧ ⎥ ⎤ ⎢ ⎡ + − × ⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ + − =
r r
T m m m t t j m m T m t mi m i t t t ri
D j m D x j m ξ ξ ς ς σ ξ μ ς ς w
Ω ∈ Ω ∈
⎪ ⎭ ⎪ ⎩ ⎥ ⎦ ⎢ ⎣
r r j
m m t
σ
⋅ =
t t m
j E D ) ( ς
44
Let Then Loss function
= = = = =
= = Λ = Λ
M m N n M j j j n m m m n m M m N n n m m
m m
P p P p p p J
1 1 1 , , 1 1 , AAP AAP
) ( ) ( ) ( ) ( ) ( λ λ λ X X X X
=
n
T t m t n m m n m
p p
1 , , ,
) ( ) ( λ λ x X
= =
= Λ
M m N n m n m
m
d J
1 1 , AAP AAP
)) , ( ( ) ( λ X l
Ω ∈
− =
m
j j j n m m m n m m n m
P p P p d ) ( log ) ( log ) , (
, , , AAP
λ λ λ X X X
45
⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ = = Λ =
= = M m N n n m m r m r n m r r r
m r r
X p P g X p J X p
1 1 , , AAP AAP AAP
) ( ) ( ) , ( ) ( max arg ) , ( max arg W W W W W
W W
λ
= =
=
M m N n n m r m r n m r
m
X p g X p J
1 1 , , MAP
) ( ) ( ) , ( log ) ( W W W λ
= =
=
M m N n n m m r m r n m r
m
X p P g X p J
1 1 , , AAP
) ( ) ( ) , ( ) ( W W W λ
AAPLR AAPLR MAPLR MAPLR
46
In a form of MCE
MCE criterion
Misclassification measure
where
Adopting diagonal covariance matrix
= =
=
M m N n r n m r
m
X d J
1 1 , AAP AAP
)) , ( ( ) ( W W l ⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ − − =
Ω ∈
r
j r j n m r r m n m r r n m
X g M X g X d )] , ; ( exp[ 1 1 log ) , ; ( ) , (
, , , AAP
W W W λ λ )} ( ) , ( log{ ) , ; (
, , r m r n m r m n m r
g X p X g W W W λ λ =
⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ − − ⋅ = Σ
= = − − d i mi m ri i t n m d i mi d r m m t n m
x p
1 2 2 , , , 1 2 / 1 2 2 / , ,
) ( 2 1 exp ) ( ) 2 ( ) , , ( σ ξ σ π μ w W x
47
1 1 , , 2 , , 1 1 2 , 1 , , 1 2 , , , , , 1 1 1 2 , , , , AAP
) ) ( ) ( 1 ( 2 ) ( ) ( ) ( ) ) ( ) ( 1 ( 2 ) ( ) ( ) (
− − Ω ∈ Ω ∈ = = − Ω ∈ = Ω ∈ = = =
⎪ ⎪ ⎪ ⎪ ⎪ ⎭ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎧ ⎥ ⎥ ⎦ ⎤ Σ Φ Ψ − − Φ Ψ − ⎢ ⎢ ⎣ ⎡ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎧ ⎥ ⎥ ⎦ ⎤ Σ Φ Ψ − − Φ Ψ − ⎢ ⎢ ⎣ ⎡ =
ri j n m j n m r ji T j j j n m j n m r n M m N n mi T m m n n m r ri ri j n m j n m r T t ji T j i t n m j n m j n m r M m N n T t mi T m i t n m n m r ri
r r m r n r m n
X X X X T T X L X X x X X x X L σ ξ ξ σ ξ ξ σ ξ σ ξ m w
Ω ∈
= Ψ
r
j r j n m r n m r
X g X )] , ; ( exp[ 1 ) (
, ,
W λ
)] , ; ( exp[ ) (
, , r j n m r n m j
X g X W λ = Φ
))) , ( ( 1 ))( , ( ( ) (
, AAP , AAP , r n m r n m n m r
X d X d X L W W l l − =
Joint Probability Competing Hypothesis Bayesian Learning Closed- Form Solution MLLR Product No No Yes MAPLR Product No Yes Yes MCELR
[Wu and Huo, 2002]
Product Yes No No MCELR
[He and Chou, 2003]
Product Yes No Yes CMLLR Product Yes No Yes MPELR Sum Yes No Yes AAPLR Sum Yes Yes Yes
49
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 log likelihood frequency of occurrences
50
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 log likelihood frequency of occurrences
51
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 log likelihood frequency of occurrences MLLR, target model MLLR, competing model AAPLR, target model AAPLR, competing model
53
Training database
Connected digits
1000 utterances 50 male and 50 female speakers
TCC300 microphone speech database
About 16 hours 100 speakers for training
54
Testing database
Car noisy speech database
50km/h car noisy speech by 10 speakers
TCC300 database
20 speakers
Broadcast news database
Radio stations Public Television Service News
MATBN database Anchor speech
55
All utterances were sampled at 16kHz with 16 bit resolution Feature representation
12 MFCCs, 1 log energy and their derivatives
Channel effect removal
Cepstral mean subtraction (CMS)/utterance
57
Database TCC300 Broadcast news Male 3.55 5.50 Female 4.86 5.47 Speaking rate (syl/sec) 35.6 36.4 36.9 38.2 gamma Poisson Gaussian With durations Without duration SER(%)
58
KL divergence measure
Empirical distribution Estimated parametric distribution
− τ τ τ τ τ d d d d d
e e
) ( ˆ ) ( ˆ log ) ( ˆ ) ( ˆ
) ( ˆ τ
e
d ) ( ˆ τ d
Parametric distribution Divergence Gaussian 0.243 Poisson 0.185 Gamma 0.134
59
30 35 40
MAP Adaptation, N=30 No Adaptation
Syllable Error Rate (%) without duration model Gaussian duration+Gaussian prior Gaussian duration+gamma prior Poisson duration+gamma prior gamma duration+Gaussian prior
15 20 25
MAP Adaptation, N=5 No Adaptation Syllable Error Rate (%) without duration model Gaussian duration+Gaussian prior Gaussian duration+gamma prior Poisson duration+gamma prior gamma duration+Gaussian prior
60
2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Duration Length Relative Frequency(%)
61
2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Duration Length Relative Frequency(%)
62
2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Duration Length Relative Frequency(%) empirical distribution Poisson distribution Poisson distribution at 3rd epoch Poisson distribution at 5th epoch
63
2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Duration Length Relative Frequency(%)
64
2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Duration Length Relative Frequency(%)
65
2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Duration Length Relative Frequency(%) empirical distribution Gaussian distribution Gaussian distribution at 5th epoch Poisson distribution Poisson distribution at 5th epoch
66
5 10 15 20 25 30 33 34 35 36 37 38 Number of Adaptation Utterances Syllable Error Rate (%) Baseline Gaussian duration+Gaussian prior Poisson duration+gamma prior
67
Gaussian Poisson Gamma Recognition Time (ms/syl) 2.6 2.5 3.2 Adaptation Time (ms/syl) 2.7 2.7 4.2
69
Two pass adaptation strategy
Task adaptation
200 utterances (30min)
Speaker adaptation
60 utterances (14min) One male and one female reporters
Testing set
40 utterances (9min)
70
5 10 15 20 40 60 28 29 30 31 32 33 34 35 number of adaptation utterances (N) syllable error rate (%) MLLR,R=4 MAPLR,R=4 MCELR,R=4 CMLLR,R=4 AAPLR,R=4 Supervised Adaptation
71
5 10 15 20 40 60 30 40 50 60 70 80 90 100 110 number of adaptation utterances (N) adaptation time (sec) MLLR MAPLR MCELR CMLLR AAPLR
73
Joint Bayesian learning framework of HMM’s and duration parameters was proposed.
Gaussian, Poisson Poisson and gamma gamma densities for duration modeling were evaluated.
QB estimates for Gaussian Gaussian and Poisson Poisson duration models were formulated.
Reproducible prior/posterior property was applied to establish the updating mechanism for prior statistics.
74
Aggregate a posteriori a posteriori linear regression linear regression algorithm was proposed for speaker adaptation.
Broadcast news transcription was carried out to evaluate
performance improvement.
AAP criterion was introduced to achieve model discriminability and derive rapid parameter estimation.
A closed
closed-
form solution to AAPLR AAPLR was derived to achieve desirable adaptation performance.
75
Duration modeling
Alternative distributions,e.g. alpha-stable distributions will be
investigated.
Application to the higher acoustic level, eg. sub-syllable, syllable,
word.
Discriminative Linear Regression Adaptation
Convergence problem Sum of probabilities vs. product of probabilities
76
Chou, W., Juang, B.-H., 2003. Pattern Recognition in Speech and Language Processing,
CRC Press.
DeGroot, M. H., 1970. Optimal Statistical Decisions. McGraw-Hill. Dempster, P., Laird, N. M., Rubin, D. B., 1977. Maximum likelihood from
incomplete data via EM algorithm. Journal of the Royal Statistical Society (B), vol. 39, pp.1-38.
Duda, R. O., Hart, P. E., Stork, D. G., 2001. Pattern Classification, John-Wiley &
Sons, Inc.
Chien, J.-T., Huang, C.-H., 2003. Bayesian learning of speech duration models.
IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 558-567.
Chien, J.-T., Huang, C.-H., 2006. Aggregate a Posteriori Linear Regression,
IEEE Transactions on ASLP.