Joint Optimisation of Tandem Systems using Gaussian Mixture Density - PowerPoint PPT Presentation

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training Chao Zhang and Phil Woodland March 8, 2017 Cambridge University Engineering Department

Introduction Tandem Systems as Mixture Density Neural Networks (MDNNs) • Tandem systems model features produced by DNN using GMMs • A bottleneck (BN) DNN and GMMs combine to form an MDNN Importance of Tandem Systems • A general framework for modelling non-Gaussian distributions • Can apply GMM techniques (e.g., adaptation) to improve MDNNs • Tandem and hybrid systems produce complementary errors Weakness of Conventional Tandem Systems • GMMs and DNN are independently estimated → suboptimal 2/17

Introduction Can Tandem and Hybrid Systems Have Comparable WERs? Improved Training of Tandem Systems • Jointly optimise tandem system with MPE or other discriminative sequence criteria • Can be viewed as MDNN hybrid system MPE training Proposed Methods • Adapt extended Baum-Welch (EBW) based GMM MPE training to use stochastic gradient descent (SGD) • Propose a set of methods to improve joint optimisation stability 3/17

Methodology System Construction Procedure • Convert GMMs to an MDNN GMM output layer for joint training Construct a BN DNN to Build BN GMM-HMMs extract tandem features by Baum-Welch CE BN DNN ML Tandem MPE Joint training of BN Convert conventional DNN + GMMs by SGD GMMs to a GMM layer MPE MDNN-HMMs 4/17

Methodology System Refinement and Decoding • GMM layer is converted back to GMMs to reuse existing facilities MPE Joint training of BN Convert the GMM layer DNN + GMMs by SGD to conventional GMMs MPE MDNN-HMMs Apply GMM-HMM based system refinement Jointly Trained Tandem 5/17

ML Tandem System Construction • monophone BN GMM-HMMs → initial triphone BN GMM-HMMs → HMM state clustering → final triphone BN GMM-HMMs Linear Activation GMM Layer FBANK BN 6/17

SGD based GMM-HMM Training GMM Parameter Update Values • Calculate the partial derivatives of F w.r.t. each GMM parameter and input value • For SGD, Gaussian component weight and std. dev. values are transformed so constraints satisfied Speed Up • Rearrange mean and std. dev. from of Gaussians as matrices • Speed up GMM calculations by highly optimised general matrix multiplication (GEMM) functions in the BLAS library 7/17

MPE Training for GMM-HMMs using SGD Regularisation • Parameter smoothing • I-smoothing with F ML : data dependent coeff. τ ML ( s , g ) • H-criterion with F MMI : fixed coeff. τ MMI (H-criterion) • L2 regularisation: λ · θ 2 / 2 • Composite objective function F MPE + τ MMI ( F MMI + τ ML ( s , g ) F ML ) + λ θ 2 / 2 Percentile based Variance Floor • Modified to find the flooring threshold more efficiently to apply frequently in SGD 8/17

Tandem System Joint Optimisation Linear to ReLU Activation Function Conversion • Observe instability issue when averaged partial derivatives w.r.t. linear BN features shifting from positive to negative • To avoid negative values, modify BN layer bias to equivalently use ReLU by b bn − µ bn + 6 σ bn Amplified GMM Learning • GMMs have a rather different functional form than DNN layers • Learning rates and L2 reg. coeff. are amplified for GMMs by α 9/17

Tandem System Joint Optimisation Relative Update Value Clipping • To avoid setting a specific threshold for each type of parameter • Assuming values are Gaussian distributed, compute thresholds of Θ based on stats. in n th mini-batch by µ Θ [ n ] + m σ Θ [ n ] Parameter Update Schemes • Update GMMs and hidden layers in an interleaved manner • Update all parameters concurrently without any restriction • Update all parameters concurrently, then update the GMMs only 10/17

Experimental Setup Data • 50h and 200h data from ASRU 2015 MGB challenge • A trigram word level LM with a 160k word dictionary • dev.sub test set contains 5.5h data with reference segmentation and 285 automatic speaker clusters Systems • All experiments were conducted with HTK 3.5 • 40-dim log-Mel filter bank features with their ∆ coefficients • DNN structure 720 × 1000 5 × { 4000 , 6000 } BN DNN structure 720 × 1000 4 × 39 × 1000 × { 4000 , 6000 } • Each GMM has 16 Gaussians ( sil / sp has 32 Gaussians) 11/17

Experimental Results Comparison of EBW and SGD GMM Training (50h) EBW+Smoothing+%Var. Floor (Baseline) 39 SGD+Fixed Var. Floor SGD+Smoothing+Fixed Var. Floor Dev.sub %WER SGD+Smoothing+L2+Fixed Var. Floor 38 SGD+Smoothing+L2+%Var. Floor 37 36 0 1 2 3 4 5 6 7 8 Iteration/Epoch Number 12/17

Experimental Results Joint Training Experiments with Different α (50h) Concurrent Update + ⍺ =50 38 Concurrent Update + ⍺ =20 Concurrent Update + ⍺ =1 dev.sub %WER 37 Interleaved Update + ⍺ =50 Extra GMM Epoch 36 35 34 0 1 2 3 4 Epoch Number 13/17

Experimental Results Comparisons Among Various 50h Systems • T 50h is comparable to hybrid MPE systems (H 50h &H 50h ) in both 2 1 2 WER and # parameters, and is useful for hybrid system (H 50h ) 4 ID System WER% T 50h ML BN-GMM-HMMs 38.4 0 T 50h MPE BN-GMM-HMMs 36.1 1 T 50h MPE MDNN-HMMs 33.8 2 H 50h CE DNN-HMMs 36.9 0 H 50h MPE DNN-HMMs 34.2 1 H 50h MPE DNN-HMMs + H 50h align. 33.7 2 1 H 50h MPE DNN-HMMs + T 50h align. 33.6 3 2 H 50h MPE DNN-HMMs + T 50h align. & tree 33.2 4 2 14/17

Experimental Results Comparisons Among Various 200h Systems • MLLR and joint decoding still improve system performance ID System WER% T 200h ML BN-GMM-HMMs 33.7 0 T 200h MPE MDNN-HMMs 29.8 1 T 200h MPE MDNN-HMMs + MLLR 28.6 2 H 200h CE DNN-HMMs 31.9 0 H 200h MPE DNN-HMMs 29.6 1 H 200h MPE DNN-HMMs + T 200h align. & tree 29.0 2 1 J 200h T 200h ⊗ H 200h joint decoding 28.3 1 1 2 J 200h T 200h ⊗ H 200h joint decoding 27.4 2 2 2 15/17

Conclusions Main Contributions Include • EBW based GMM-HMM MPE training is extended to SGD • MDNN discriminative sequence training is studied as tandem system joint optimisation • A set of methods are modified/proposed to improve training that result in an 6.4% rel. WER reduction over MPE tandem systems The Jointly Trained Tandem System • is comparable to MPE hybrid systems in WER and # parameters • is useful for hybrid system construction and system combination • can also benefit from existing GMM approaches (e.g., MLLR) 16/17

Thanks for listening! 17/17

Joint Optimisation of Tandem Systems using Gaussian Mixture Density - PowerPoint PPT Presentation

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training Chao Zhang and Phil Woodland March 8, 2017 Cambridge University Engineering Department Introduction Tandem Systems as Mixture

DNA Short Tandem Repeats Organism DNA Short Tandem Repeats Organ DNA Short Tandem Repeats Cell

Variability of an artificial tandem repeat Ted Pak HURS 2007 Variability of an artificial tandem

Tandem modeling investigations Dan Ellis International Computer Science Institute, Berkeley CA

Medicines optimisation The road to excellence Workshop Overview of meds optimisation Your

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Modeling Wind Shielding for FPSO Tandem Offloading using CFD Bob Gordon, Granherne Satpreet

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Tandem Nishita Muhnot | Kevin Scott | Tiffany Tsai | Ari Zilnik Whats Tandem? The

Tandem bike for autistic person (Team Tandem) Team Members: Client: Callie Mataczynski - Team

Orientations bipolaires et chemins tandem Eric Fusy (CNRS/LIX) Travaux avec Mireille

The Potential of Tandem Photovoltaic Solar Cells Tandem Photovoltaic Solar Cells for Indoor

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Automated and Accurate Geometry Extraction and Shape Optimisation of 3D Topology Optimisation

Introductory Course on Non-smooth Optimisation Lecture 09 - Non-convex optimisation Jingwei Liang

Solving fixed-point equations on -continuous semirings Javier Esparza Technische Universit

Representing Joint Hierarchies with Box Embeddings Dhruvesh Patel, Shib Sankar Dasgupta,

Multiagent models for partially observable environments Matthijs Spaan Institute for Systems and

Scaled VIP Algorithms for Joint Dynamic Forwarding and Caching in Named Data Networks Ying Cui

Biomechanics Agenda Review Biomechanical modeling Review: Skeletomuscular Levers

Table of Contents Topic Page No. Net Income (Loss) to Adjusted Earnings 3 Adjusted Earnings

Independence Assumptions Kostas Tzoumas, Amol Deshpande, Christian S. Jensen Presented by

Joint modeling of longitudinal and survival data Yulia Marchenko Executive Director of

Joint Optimisation of Tandem Systems using Gaussian Mixture Density - PowerPoint PPT Presentation

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training Chao Zhang and Phil Woodland March 8, 2017 Cambridge University Engineering Department Introduction Tandem Systems as Mixture

DNA Short Tandem Repeats Organism DNA Short Tandem Repeats Organ DNA Short Tandem Repeats Cell

Variability of an artificial tandem repeat Ted Pak HURS 2007 Variability of an artificial tandem

Tandem modeling investigations Dan Ellis International Computer Science Institute, Berkeley CA

Medicines optimisation The road to excellence Workshop Overview of meds optimisation Your

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Modeling Wind Shielding for FPSO Tandem Offloading using CFD Bob Gordon, Granherne Satpreet

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Tandem Nishita Muhnot | Kevin Scott | Tiffany Tsai | Ari Zilnik Whats Tandem? The

Tandem bike for autistic person (Team Tandem) Team Members: Client: Callie Mataczynski - Team

Orientations bipolaires et chemins tandem Eric Fusy (CNRS/LIX) Travaux avec Mireille

The Potential of Tandem Photovoltaic Solar Cells Tandem Photovoltaic Solar Cells for Indoor

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Automated and Accurate Geometry Extraction and Shape Optimisation of 3D Topology Optimisation

Introductory Course on Non-smooth Optimisation Lecture 09 - Non-convex optimisation Jingwei Liang

Solving fixed-point equations on -continuous semirings Javier Esparza Technische Universit

Representing Joint Hierarchies with Box Embeddings Dhruvesh Patel*, Shib Sankar Dasgupta*,

Multiagent models for partially observable environments Matthijs Spaan Institute for Systems and

Scaled VIP Algorithms for Joint Dynamic Forwarding and Caching in Named Data Networks Ying Cui

Biomechanics Agenda Review Biomechanical modeling Review: Skeletomuscular Levers

Table of Contents Topic Page No. Net Income (Loss) to Adjusted Earnings 3 Adjusted Earnings

Independence Assumptions Kostas Tzoumas, Amol Deshpande, Christian S. Jensen Presented by

Joint modeling of longitudinal and survival data Yulia Marchenko Executive Director of

Representing Joint Hierarchies with Box Embeddings Dhruvesh Patel, Shib Sankar Dasgupta,