Discriminative Linear Transforms for Feature Normalization and - - PowerPoint PPT Presentation

discriminative linear transforms for feature
SMART_READER_LITE
LIVE PREVIEW

Discriminative Linear Transforms for Feature Normalization and - - PowerPoint PPT Presentation

Discriminative Linear Transforms for Feature Normalization and Speaker Adaptation in HMM Estimation Stavros Tsakalidis, Vlasios Doumpiotis, William Byrne Center for Language and Speech Processing Department of Electrical and Computer Engineering


slide-1
SLIDE 1

Discriminative Linear Transforms for Feature Normalization and Speaker Adaptation in HMM Estimation

Stavros Tsakalidis, Vlasios Doumpiotis, William Byrne Center for Language and Speech Processing Department of Electrical and Computer Engineering The Johns Hopkins University Baltimore MD, USA

slide-2
SLIDE 2

Discriminative Linear Transforms

Goal: Develop discriminative versions of existing Maximum Likelihood training procedures Focus on: Techniques that incorporate ML estimation of linear transforms during training:

  • MLLT: Transform acoustic data to ease diagonal covariance Gaussian modeling assumption
  • SAT: Apply speaker dependent transforms to speaker independent models

Prior Work: Both MLLT and SAT were developed as ML techniques, but have also been used with MMI

  • The AT&T LVCSR-2001 system used:

– feature-based transforms obtained by ML estimation techniques, and – were then fixed throughout the subsequent iterations of MMI model estimation

  • McDonough et al. (ICASSP’02) combined SAT with MMI by

– estimating the SD transforms under ML, and – subsequently using MMI for the estimation of the SI HMM Gaussian parameters. Estimation Criterion: To develop discriminative versions of these techniques, we use Conditional Maximum Likelihood (CML) estimation procedures

  • CMLLR developed by Asela Gunawardana
  • Used for unsupervised discriminative adaptation in the JHU LVCSR-2001 evaluation system

ICSLP - Sept 2002 Center for Language and Speech Processing 2

slide-3
SLIDE 3

CML Auxiliary function

  • CML criterion uses a general auxiliary function similar to EM
✁ ✂ ✄ ☎ ✆ ✝ ✞ ☎ ✟✡✠ ☛ ☞ ✌ ✍ ☎ ✎ ☞ ✏ ✂ ✑✓✒ ✍ ✔ ☎ ✎ ☞ ✏ ✂ ✑ ✕✗✖ ✘ ✙ ✚ ✛ ✜ ✢ ✏✤✣ ☞ ✥✧✦ ★ ✁ ✂ ✑✪✩ ☎ ✆ ✝ ✞ ☎ ✟✡✠ ☛ ✫ ☎ ✢ ✏✤✣ ✥✧✦ ★ ✂ ✑ ✘ ✙ ✚ ✛ ✜ ✢ ✏✤✣ ✥✧✦ ★ ✁ ✂ ✑✤✬ ✣ ✭ ✮

is the parameter we wish to estimate under the CML criterion

  • Parameters values are tied over sets of states, defined by the regression classes
✯ ☛ ✭ ✰ ✦ ✥✲✱ ✏ ✦ ✑ ✭ ✳ ✴
  • We apply this criterion to two estimation problems:
  • 1. Covariance modeling
  • 2. Speaker adaptive training
  • State dependent distributions are reparametrized to incorporate the linear transforms
  • CML versions of MLLT and SAT are readily obtained
  • Goal is to maximize
✵ ✏✡✶ ✥✸✷ ★ ✂ ✑

by alternately updating the transforms and HMM parameters

  • As a result both transforms and HMM Gaussian parameters are estimated discriminatively

ICSLP - Sept 2002 Center for Language and Speech Processing 3

slide-4
SLIDE 4

Discriminative Likelihood Linear Transforms

Goal: Transform feature vector to capture the correlation between the vector components Apply affine transform matrix

✹ ✭ ✺✧✻ ✼ ✽

to the (extended) observation vector

✾ ✹ ✾ ✭ ✼ ✣ ✩ ✻

Under the preceding model, the reparametrized emission density of state

is

✢ ✏✤✣ ✥ ✦ ★ ✂ ✑ ✭ ✥ ✼ ☛ ✥ ✿ ✏❁❀ ❂ ✑❄❃ ✥✤❅ ☎ ✥ ❆❈❇ ✞❊❉ ❋
❍ ■ ✟❑❏ ▲◆▼ ❖ ■ ✞❊❉ ❋
❍ ■ ✟ P✪◗ ❘ ✦ ✄ ❙ ✏ ✦ ✑ ✭ ✳

Objective: We estimate the transforms and HMM parameters under the CML criterion The transforms obtained under this criterion are termed Discriminative Likelihood Linear Transforms (DLLT) This estimation is performed as a two-stage iterative procedure: a) First maximize the CML criterion with respect to the transforms while keeping the Gaussian parameters fixed b) Subsequently, we compute the Gaussian parameters using the updated values of the affine transforms

ICSLP - Sept 2002 Center for Language and Speech Processing 4

slide-5
SLIDE 5

Effective DLLT Estimation

As in MLLT (Gales ’97), the

❚❱❯❲

row of the transformation matrix is found by

✺ ✁ ✹ ☛ ✽ ❳ ✭ ✏✤❨ ❩ ❳ ✩ ❬ ☛ ✎ ❳ ✑✤❭ ❇ ❪ ☛ ✎ ❳

where

❭ ☛ ✎ ❳ ✭ ☎ ✆ ✝ ✞ ☎ ✟ ✠ ☛ ❫ ❴ ◗ ☎ ✎ ❳ ❵ ☞ ✌ ✍ ☎ ✎ ☞ ✏ ✂ ✑✓✒ ✍ ✔ ☎ ✎ ☞ ✏ ✂ ✑ ✕ ✾ ☞ ✾ ❉ ☞ ✩ ✫ ☎ ❛ ☎ ❜ ❬ ☛ ✎ ❳ ✭ ☎ ✆ ✝ ✞ ☎ ✟ ✠ ☛ ❝ ☎ ✎ ❳ ❴ ◗ ☎ ✎ ❳ ❵ ☞ ✌ ✍ ☎ ✎ ☞ ✏ ✂ ✑✓✒ ✍ ✔ ☎ ✎ ☞ ✏ ✂ ✑ ✕ ✾ ❉ ☞ ✩ ✫ ☎ ✺ ❛ ☎ ✽ ❪ ❜ ❛ ☎ ✭ ❞ ❫ ✏ ✼ ❇ ❪ ☛ ✏ ❝ ☎ ✒ ✻ ☛ ✑ ✑ ❉ ✼ ❇ ❪ ☛ ✏ ❝ ☎ ✒ ✻ ☛ ✑ ✼ ❇ ❪ ☛ ✏ ❅ ☎ ✩ ✏ ❝ ☎ ✒ ✻ ☛ ✑ ✏ ❝ ☎ ✒ ✻ ☛ ✑ ❉ ✑ ✼ ❇ ❪ ❉ ☛ ❡ ❢

Problem: The diagonal terms of

❅ ☎ ✩ ✏ ❝ ☎ ✒ ✻ ☛ ✑ ✏ ❝ ☎❣✒ ✻ ☛ ✑ ❉

dominate when

❅ ☎

is diagonal.

  • The large values of
✫ ☎

as used in MMI further exaggerate this effect.

  • The resulting DLLT transform is effectively identity.

Solution: Replace

❅ ☎

in

❛ ☎

by the estimate of its full covariance matrix.

ICSLP - Sept 2002 Center for Language and Speech Processing 5

slide-6
SLIDE 6

Discriminative Speaker Adaptive Training

Goal: Reduce the inter-speaker variability within the training set Apply speaker dependent transforms to speaker independent means Under the preceding model, the reparametrized emission density for state

and speaker

is

✢ ✏✤✣ ✥✧✦ ★ ✂ ✐ ❤ ✑ ✭ ❫ ✿ ✏❁❀ ❂ ✑❦❥ ✥ ❅ ☎ ✥ ❆ ❇ ✞ ❧ ❇ ❉✧♠ ♥ ■ ✟ ❏ ▲◆▼ ❖ ■ ✞ ❧ ❇ ❉✧♠ ♥ ■ ✟ P ◗

Objective: Compute the speaker dependent transforms and speaker independent parameters of the state dependent distribution under the CML criterion. We call this procedure Discriminative Speaker Adaptive Training (DSAT). This estimation is performed as a two-stage iterative procedure: a) We first maximize the CML criterion with respect to the speaker dependent affine transforms while keeping the speaker independent means fixed to their current values. b) Subsequently, we compute the speaker independent means and variances using the updated values

  • f the speaker dependent affine transforms.

ICSLP - Sept 2002 Center for Language and Speech Processing 6

slide-7
SLIDE 7

System Description

  • Acoustic Models

– Standard HTK flat-start training procedure – Tied state, cross-word, context-dependent triphones – 4000 unique triphone states – 6 mixtures per speech state – tagged acoustic clustering to incorporate interjection and word-boundary info

  • Training/Test Set

– The collection defined the minitrain & minitest set for the 2001 JHU LVCSR system – Training: 16.4 hours from Switchboard-1 and 0.5 hour from Callhome English data – Test: 866 utterances from the 2000 Hub-5 Switchboard-1 evaluation set (Swbd1) and 913 utterances from the 1998 Hub-5 Switchboard-2 evaluation set (Swbd2)

ICSLP - Sept 2002 Center for Language and Speech Processing 7

slide-8
SLIDE 8

MMI training & Regression Class Selection

  • Discriminative training requires alternate word sequences that are representative of the recognition errors

made by the decoder: – Obtain triphone lattices generated on the training data, using the AT&T FSM decoder. – Use the Viterbi procedure over triphone segments, rather than accumulating statistics via the Forward- Backward procedure at the word level. – These triphone segments are fixed throughout MMI training.

  • Assignment of Gaussians into classes:

– Use a variation of the HTK regression class tree implementation – All states of all context-dependent phones associated with the same monophone are assigned to the same initial class – Apply the HTK splitting algorithm to each of the initial classes – Constraint: all the mixture components associated with the same state belong to the same regression class

ICSLP - Sept 2002 Center for Language and Speech Processing 8

slide-9
SLIDE 9

Goals of the Experiments

  • Compare ML trained transforms to CML trained transforms:

– Gaussian parameters are fixed throughout transform updates – test whether CML transforms improve over ML transforms – validate CML as a modeling procedure

  • Compare ML training techniques (MLLT, SAT) to their fully discriminative counterparts:

– investigate fully discriminative training compared to ML training

  • Identify a proper initialization point for our discriminative techniques:

– proper seeding of DLLT and DSAT turns out to be crucial

ICSLP - Sept 2002 Center for Language and Speech Processing 9

slide-10
SLIDE 10

DLLT Experiments - WER(%)

Throughout the experiments we use a fixed set of regression classes Table A: Estimation of transforms under ML (MLLT) and CML (DLLT). No mean and variance update Table B: CML update of transforms and Gaussian parameters when seeded from the ML baseline Table C: CML update of transforms and Gaussian parameters when seeded from a well trained MLLT system A B C Transform Reestimation Only DLLT MLLT and DLLT Swbd1 Swbd2 Swbd1 Swbd2 Swbd1 Swbd2 ML 41.1 51.1 41.1 51.1 41.1 51.1 ML+MLLT-1it 39.1 50.3 DLLT-1it 38.2 49.2 MLLT-1it 38.4 49.6 ML+MLLT-2it 39.4 50.4 DLLT-2it 37.3 48.9 MLLT-2it 38.2 49.5 ML+DLLT-1it 38.5 49.7 DLLT-3it 37.8 48.8 MLLT-3it 38.2 49.3 ML+DLLT-2it 38.3 49.9 MLLT-6it 37.8 49.0 MLLT+DLLT-1it 37.4 48.6 MLLT+DLLT-2it 36.8 48.6 Observations:

  • DLLT in isolation is better than MLLT (A)
  • DLLT works best when initialized by MLLT (B vs. C)

ICSLP - Sept 2002 Center for Language and Speech Processing 10

slide-11
SLIDE 11

DSAT Experiments - WER(%)

  • Throughout the experiments we used a fixed set of 2 regression classes (speech & silence)
  • Decoding results include unsupervised MLLR adaptation

ML-SAT ML updating of transforms and Gaussian means and variances DSAT CML updating of transforms and Gaussian means seeded by ML-SAT ML-SAT DSAT Swbd1 Swbd2 Swbd1 Swbd2 35.9 47.0

♦ ♦

1 35.7 45.6 34.1 44.7 2 35.2 45.4 33.8 44.6 3 35.0 45.2 33.6 44.5 4 34.7 45.1 33.4 44.3 5

34.5 44.9 33.4 44.2 6 34.3 45.0 7 34.0 45.0 8 34.1 44.9 Conclusion: Discriminative estimation improves over ML estimation of speaker dependent transforms and speaker independent mean parameters

ICSLP - Sept 2002 Center for Language and Speech Processing 11

slide-12
SLIDE 12

Conclusions

  • Integrated discriminative linear transforms into MMI estimation for LVCSR
  • Developed estimation procedures that find discriminative transforms in conjunction with MMI for:

– speaker adaptive training – feature normalization

  • We have found that discriminative versions of speaker adaptive training and feature normalization outperform

ML training

  • Each technique gives approximately 0.8% absolute WER improvement on the Switchboard corpus over the

ML estimation procedures

  • DLLT and DSAT were used in the JHU LVCSR-2002 Evaluation System
  • Future work:

– DSAT and DLLT may yield complementary improvements in performance when used together if in fact they are capturing different acoustic phenomena

  • For more information see:

http://www.clsp.jhu.edu/research/rteval

ICSLP - Sept 2002 Center for Language and Speech Processing 12