Discriminative Linear Transforms for Feature Normalization and - PowerPoint PPT Presentation

Discriminative Linear Transforms for Feature Normalization and Speaker Adaptation in HMM Estimation Stavros Tsakalidis, Vlasios Doumpiotis, William Byrne Center for Language and Speech Processing Department of Electrical and Computer Engineering The Johns Hopkins University Baltimore MD, USA

� � � � � � Discriminative Linear Transforms Goal: Develop discriminative versions of existing Maximum Likelihood training procedures Focus on: Techniques that incorporate ML estimation of linear transforms during training: MLLT: Transform acoustic data to ease diagonal covariance Gaussian modeling assumption SAT: Apply speaker dependent transforms to speaker independent models Prior Work: Both MLLT and SAT were developed as ML techniques, but have also been used with MMI The AT&T LVCSR-2001 system used: – feature-based transforms obtained by ML estimation techniques, and – were then fixed throughout the subsequent iterations of MMI model estimation McDonough et al. (ICASSP’02) combined SAT with MMI by – estimating the SD transforms under ML, and – subsequently using MMI for the estimation of the SI HMM Gaussian parameters. Estimation Criterion: To develop discriminative versions of these techniques, we use Conditional Maximum Likelihood (CML) estimation procedures CMLLR developed by Asela Gunawardana Used for unsupervised discriminative adaptation in the JHU LVCSR-2001 evaluation system ICSLP - Sept 2002 Center for Language and Speech Processing 2

✦ � ✚ ✙ ✘ ✑ ✂ ★ ✵ ✢ ✜ ☎ ✫ ☛ ★ ☎ ✞ ✝ ✆ ✛ ✢ ✂ ✂ ✰ ✭ ☛ ✯ ✦ � ✑ � ✴ ✮ ✭ ✣ ✭ ✂ ✁ ★ ✳ ☎ ✂ ✏ ✍ ✁ ✂ ✏ ☞ ✎ ☎ ✌ ✔ ☞ ☛ ✑ ☎ ✞ ✝ ✍ ☎ ☎ ✛ ★ ✁ ☞ ✂ ✢ ✜ ✚ ✎ ✙ ✘ ✄ ✑ ✂ ✏ ☞ ✆ CML Auxiliary function CML criterion uses a general auxiliary function similar to EM ✑✓✒ ✕✗✖ ✏✤✣ ✥✧✦ ✏✤✣ ✥✧✦ ✏✤✣ ✥✧✦ ✑✤✬ ✑✪✩ ✟✡✠ ✟✡✠ is the parameter we wish to estimate under the CML criterion Parameters values are tied over sets of states, defined by the regression classes ✥✲✱ - We apply this criterion to two estimation problems: 1. Covariance modeling 2. Speaker adaptive training - State dependent distributions are reparametrized to incorporate the linear transforms - CML versions of MLLT and SAT are readily obtained - Goal is to maximize by alternately updating the transforms and HMM parameters ✏✡✶ ✥✸✷ - As a result both transforms and HMM Gaussian parameters are estimated discriminatively ICSLP - Sept 2002 Center for Language and Speech Processing 3

✥ ☎ ✥ ✼ ☛ ✥ ✿ ✄ ❂ ✦ ❘ ✟ ✑ ■ ❍ ❋ ● ❇ ❍ ■ ❇ ● ❖ ✭ ✂ ❋ ✭ ✳ ✹ ✭ ✭ ✼ ✽ ✑ ✾ ✹ ✾ ✼ ★ ✣ ✩ ✻ ✦ ✦ ✏ ✢ ❙ ✥ ✦ ■ Discriminative Likelihood Linear Transforms Goal: Transform feature vector to capture the correlation between the vector components Apply affine transform matrix to the (extended) observation vector ✺✧✻ Under the preceding model, the reparametrized emission density of state is ▲◆▼ ✟❑❏ ❆❈❇ ✞❊❉ ✞❊❉ P✪◗ ✏✤✣ ✑❄❃ ✏❁❀ ✥✤❅ Objective: We estimate the transforms and HMM parameters under the CML criterion The transforms obtained under this criterion are termed Discriminative Likelihood Linear Transforms (DLLT) This estimation is performed as a two-stage iterative procedure: a) First maximize the CML criterion with respect to the transforms while keeping the Gaussian parameters fixed b) Subsequently, we compute the Gaussian parameters using the updated values of the affine transforms ICSLP - Sept 2002 Center for Language and Speech Processing 4

✂ ☎ ❞ ❫ ✏ ✼ ❇ ❪ ☛ ✏ ❝ ✒ ☎ ✻ ☛ ✑ ✑ ❉ ✼ ❇ ❪ ☛ ✏ ✭ ❛ ☎ ✕ ✂ ☎ ✍ ✔ ☎ ✎ ☞ ✏ ❅ ✑ ✾ ❜ ❉ ☞ ✩ ✫ ☎ ✺ ❛ ☎ ✽ ❪ ❝ ✒ ☞ ☎ ✼ ❇ ❪ ❉ ☛ ❡ ❢ ❉ ❅ ✩ ❉ ✏ ❝ ☎ ✒ ✻ ☛ ✑ ✏ ❝ ✑ ✑ ✑ ✻ ✏ ☛ ✑ ✼ ❇ ❪ ☛ ✏ ❅ ☎ ✩ ❝ ☛ ☎ ✒ ✻ ☛ ✑ ✏ ❝ ☎ ✒ ✻ ✏ ✎ ☛ ✠ ✎ ❳ ✭ ☎ ✆ ✝ ✞ ☎ ✟ ☛ ❭ ❫ ❴ ◗ ☎ ✎ ❳ ❵ ☞ ✌ ✍ ☛ ☎ ✎ ❩ ☎ ❛ ✺ ✁ ✹ ☛ ✽ ❳ ✭ ☎ ❳ ❳ ✩ ❬ ☛ ✎ ❳ ❅ ❇ ❪ ☛ ✎ ☎ ☞ ☎ ☎ ✭ ☎ ✆ ✝ ✞ ☎ ✟ ✠ ☛ ❝ ✎ ✎ ❳ ❴ ◗ ☎ ✎ ❳ ❵ ☞ ✌ ✍ ❳ ☛ ✏ ✕ ✂ ✫ ✍ ✔ ☎ ✎ ☞ ✏ ✂ ✑ ✾ ❬ ☞ ✾ ❉ ☞ ✩ ✫ ☎ ❛ ☎ ❜ ✻ Effective DLLT Estimation As in MLLT (Gales ’97), the row of the transformation matrix is found by ❚❱❯❲ ✏✤❨ ✑✤❭ where ✑✓✒ ✑✓✒ Problem: The diagonal terms of dominate when is diagonal. ☎❣✒ - The large values of as used in MMI further exaggerate this effect. - The resulting DLLT transform is effectively identity. by the estimate of its full covariance matrix. Solution: Replace in ICSLP - Sept 2002 Center for Language and Speech Processing 5

❧ ❆ ❂ ✞ ✥ ❅ ☎ ✥ ❇ ✿ ✞ ■ ❇ ❖ ♥ ■ ❧ ❫ ❏ ✢ ◗ P ✦ ✟ ❤ ■ ♥ ✭ ❇ ★ ✂ ✐ ❤ ✑ ✟ Discriminative Speaker Adaptive Training Goal: Reduce the inter-speaker variability within the training set Apply speaker dependent transforms to speaker independent means Under the preceding model, the reparametrized emission density for state and speaker is ▲◆▼ ❉✧♠ ❉✧♠ ✏✤✣ ✥✧✦ ✑❦❥ ✏❁❀ Objective: Compute the speaker dependent transforms and speaker independent parameters of the state dependent distribution under the CML criterion. We call this procedure Discriminative Speaker Adaptive Training (DSAT). This estimation is performed as a two-stage iterative procedure: a) We first maximize the CML criterion with respect to the speaker dependent affine transforms while keeping the speaker independent means fixed to their current values. b) Subsequently, we compute the speaker independent means and variances using the updated values of the speaker dependent affine transforms. ICSLP - Sept 2002 Center for Language and Speech Processing 6

� � System Description Acoustic Models – Standard HTK flat-start training procedure – Tied state, cross-word, context-dependent triphones – 4000 unique triphone states – 6 mixtures per speech state – tagged acoustic clustering to incorporate interjection and word-boundary info Training/Test Set – The collection defined the minitrain & minitest set for the 2001 JHU LVCSR system – Training: 16.4 hours from Switchboard-1 and 0.5 hour from Callhome English data – Test: 866 utterances from the 2000 Hub-5 Switchboard-1 evaluation set (Swbd1) and 913 utterances from the 1998 Hub-5 Switchboard-2 evaluation set (Swbd2) ICSLP - Sept 2002 Center for Language and Speech Processing 7

� � MMI training & Regression Class Selection Discriminative training requires alternate word sequences that are representative of the recognition errors made by the decoder: – Obtain triphone lattices generated on the training data, using the AT&T FSM decoder. – Use the Viterbi procedure over triphone segments, rather than accumulating statistics via the Forward- Backward procedure at the word level. – These triphone segments are fixed throughout MMI training. Assignment of Gaussians into classes: – Use a variation of the HTK regression class tree implementation – All states of all context-dependent phones associated with the same monophone are assigned to the same initial class – Apply the HTK splitting algorithm to each of the initial classes – Constraint: all the mixture components associated with the same state belong to the same regression class ICSLP - Sept 2002 Center for Language and Speech Processing 8

� � � Goals of the Experiments Compare ML trained transforms to CML trained transforms: – Gaussian parameters are fixed throughout transform updates – test whether CML transforms improve over ML transforms – validate CML as a modeling procedure Compare ML training techniques (MLLT, SAT) to their fully discriminative counterparts: – investigate fully discriminative training compared to ML training Identify a proper initialization point for our discriminative techniques: – proper seeding of DLLT and DSAT turns out to be crucial ICSLP - Sept 2002 Center for Language and Speech Processing 9

Discriminative Linear Transforms for Feature Normalization and - PowerPoint PPT Presentation

Discriminative Linear Transforms for Feature Normalization and Speaker Adaptation in HMM Estimation Stavros Tsakalidis, Vlasios Doumpiotis, William Byrne Center for Language and Speech Processing Department of Electrical and Computer Engineering

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Learning From Data Lecture 10 Nonlinear Transforms The Z -space Polynomial transforms Be

JUST THE MATHS SLIDES NUMBER 16.7 LAPLACE TRANSFORMS 7 (An appendix) by A.J.Hobson One

Drawing on the Web CSS CSCI-UA 380 Transforms, Transitions, and Animation Drawing on the Web

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

JUST THE MATHS SLIDES NUMBER 16.10 Z-TRANSFORMS 3 (Solution of linear difference

Generative vs. discriminative Generative Discriminative Belief network A is more More

Discriminative word alignment by learning the Discriminative word alignment by learning the

Three models for discriminative machine Three models for discriminative machine translation using

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Discriminative Feature Extraction and Dimension Reduction - PCA & LDA Berlin Chen, 2004

Managing Transparency Related Risks Presented by Sam Light, CIRSA General Counsel 8.27.2020

A Collective Voice and Vision: Charlotte Mecklenburg Library Staff Needs Assessment Anthony

Speed Serial-Link Transmitter Designs Ikchan Jang 1 , Soyeon Joo 1 , SoYoung Kim 1 , Jintae Kim 2

Message Passing Threads communicate via send and receive along channels instead of read and

eRm Extended Rasch Modeling Item response theory Rasch measurement scale The

CS 251 Fall 2019 CS 251 Fall 2019 Principles of Programming Languages Principles of

Asset Pricing Chapter VII. The Capital Asset Pricing Model: Another View About Risk June 20,

Revisit to Globally Coupled Maps after 30 year Hierarchical Clustering, Chaotic Griffith Phase,

Discriminative Linear Transforms for Feature Normalization and - PowerPoint PPT Presentation

Discriminative Linear Transforms for Feature Normalization and Speaker Adaptation in HMM Estimation Stavros Tsakalidis, Vlasios Doumpiotis, William Byrne Center for Language and Speech Processing Department of Electrical and Computer Engineering

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Learning From Data Lecture 10 Nonlinear Transforms The Z -space Polynomial transforms Be

JUST THE MATHS SLIDES NUMBER 16.7 LAPLACE TRANSFORMS 7 (An appendix) by A.J.Hobson One

Drawing on the Web CSS CSCI-UA 380 Transforms, Transitions, and Animation Drawing on the Web

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

JUST THE MATHS SLIDES NUMBER 16.10 Z-TRANSFORMS 3 (Solution of linear difference

Generative vs. discriminative Generative Discriminative Belief network A is more More

Discriminative word alignment by learning the Discriminative word alignment by learning the

Three models for discriminative machine Three models for discriminative machine translation using

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Discriminative Feature Extraction and Dimension Reduction - PCA &amp; LDA Berlin Chen, 2004

Managing Transparency Related Risks Presented by Sam Light, CIRSA General Counsel 8.27.2020

A Collective Voice and Vision: Charlotte Mecklenburg Library Staff Needs Assessment Anthony

Speed Serial-Link Transmitter Designs Ikchan Jang 1 , Soyeon Joo 1 , SoYoung Kim 1 , Jintae Kim 2

Message Passing Threads communicate via send and receive along channels instead of read and

eRm Extended Rasch Modeling Item response theory Rasch measurement scale The

CS 251 Fall 2019 CS 251 Fall 2019 Principles of Programming Languages Principles of

Asset Pricing Chapter VII. The Capital Asset Pricing Model: Another View About Risk June 20,

Revisit to Globally Coupled Maps after 30 year Hierarchical Clustering, Chaotic Griffith Phase,

Discriminative Feature Extraction and Dimension Reduction - PCA & LDA Berlin Chen, 2004