 
              CU-HTK April 2002 Switchboard System Phil Woodland, Gunnar Evermann, Mark Gales, Thomas Hain, Andrew Liu, Gareth Moore, Dan Povey & Lan Wang May 7th 2002 Cambridge University Engineering Department Rich Transcription Workshop 2002
Woodland, Evermann, Gales, Hain, Liu, Moore, Povey & Wang: CU-HTK April 2002 Switchboard system Overview • Review of CU-HTK 2001 system • Minimum Phone Error (MPE) training • HLDA • Speaker Adaptive Training • Single Pronunciation dictionaries • 2002 system & results • Fast contrast systems • Conclusions Cambridge University Rich Transcription Workshop 2002 1 Engineering Department
Woodland, Evermann, Gales, Hain, Liu, Moore, Povey & Wang: CU-HTK April 2002 Switchboard system Review of CU-HTK 2001 System: Basic Features • Front-end – Reduced bandwidth 125–3800 Hz – 12 MF-PLP cepstral parameters + C0 and 1st/2nd derivatives – Side-based cepstral mean and variance normalisation – Vocal tract length normalisation in training and test • Decision tree state clustered, context dependent triphone & quinphone models: MMIE and MLE versions • Generate lattices with MLLR-adapted models • Rescore using iterative lattice MLLR + Full-Variance transform adaptation • Posterior probability decoding via confusion networks • System combination Cambridge University Rich Transcription Workshop 2002 2 Engineering Department
✂✄ �✁ ☎✆ ✝✞ Woodland, Evermann, Gales, Hain, Liu, Moore, Povey & Wang: CU-HTK April 2002 Switchboard system 2001 System Structure GI, MMIE GD, MLE, ST P1 GI, MLE triphones, 27k, tgint98 4−gram Lattices Resegmentation Gender detection LATMLLR MLLR FV 1 trans. 2−4 trans. Triphones PPROB P4b P4a VTLN,CMN, CVN CN P2 LATMLLR MLLR Lattice 2−4 trans. 1 trans. GI, MMIE triphones, 54k, fgint00 CN Quinphones P5a P5b 1−best P3 MLLR, 1 speech transform CNC GI, MMIE triphones, 54k, fgintcat00 4−gram Lattices Final result cu−htk1 Cambridge University Rich Transcription Workshop 2002 3 Engineering Department
Woodland, Evermann, Gales, Hain, Liu, Moore, Povey & Wang: CU-HTK April 2002 Switchboard system Acoustic Training/Test Data h5train00 248 hours Switchboard (Swbd1), 17 hours CallHome English (CHE) h5train00sub 60 hours Swbd1, 8 hours CHE h5train02 h5train00 + LDC cell1 corpus (without dev01/eval01 sides) extra 17 hours of data Development test sets dev01 40 sides Swbd2 (eval98), 40 sides Swbd1 (eval00), 38 sides Swbd2 cellular (dev01-cell) dev01sub half of the dev01 selected to give similar WER to full set eval98 40 sides Swbd2 (eval98-swbd2), 40 sides of CHE (eval98-che) Cambridge University Rich Transcription Workshop 2002 4 Engineering Department
Woodland, Evermann, Gales, Hain, Liu, Moore, Povey & Wang: CU-HTK April 2002 Switchboard system 2001 System Results on dev01 set Swbd1 Swbd2 Cellular Total P1 VTLN/gender det 31.7 46.9 48.1 42.1 P2 initial trans. 23.5 38.6 39.2 33.7 P3 lat gen 21.1 36.0 36.7 31.2 P4a MMIE tri 20.0 33.5 34.0 29.1 P4b MLE tri 21.3 35.0 35.4 30.5 P5a MMIE quin 19.8 33.2 33.4 28.7 P5b MLE quin 20.2 34.0 34.2 29.4 CNC P5a+P4a+P5b 18.3 31.9 32.1 27.3 %WER on dev01 for all stages of 2001 system • final confidence scores have NCE 0.254 Cambridge University Rich Transcription Workshop 2002 5 Engineering Department
Woodland, Evermann, Gales, Hain, Liu, Moore, Povey & Wang: CU-HTK April 2002 Switchboard system Minimum Phone Error & Other Discriminative Criteria • MMIE maximises the posterior probability of the correct sentence Problem: sensitive to outliers • MCE maximises a smoothed approximation to the sentence accuracy Problem: cannot easily be implemented with lattices; scales poorly to long sentences • Criterion we evaluate in testing is word error rate: makes sense to maximise something similar to it • MPE uses smoothed approximation to phone error but can use lattice-based implementation developed for MMIE • Note that MPE is an approximation to phone error in a word recognition context i.e. uses word-level recognition, but scoring is on a phone error basis. • Can directly maximise a smoothed word error rate → Minimum Word Error (MWE). Performance for MWE slightly worse than MPE, so main focus here on MPE Cambridge University Rich Transcription Workshop 2002 6 Engineering Department
Woodland, Evermann, Gales, Hain, Liu, Moore, Povey & Wang: CU-HTK April 2002 Switchboard system MPE Objective Function • Maximise the following function: R s p λ ( O r | s ) κ P ( s )RawAccuracy( s ) � � F MPE ( λ ) = � s p λ ( O r | s ) κ P ( s ) r where λ are the HMM parameters, O r the speech data for file r , κ a probability scale and P ( s ) the LM probability of s • RawAccuracy( s ) measures the number of phones correctly transcribed in sentence s (derived from word recognition). i.e. # correct phones in s − # inserted phones in s • F MPE ( λ ) is weighted average of RawAccuracy( s ) over all s • Scale acoustic log-likelihoods by scale κ . • Criterion is to be maximised, not minimised (for compatibility with MMIE) Cambridge University Rich Transcription Workshop 2002 7 Engineering Department
Woodland, Evermann, Gales, Hain, Liu, Moore, Povey & Wang: CU-HTK April 2002 Switchboard system Lattice Implementation of MMIE: Review • Generate lattices marked with time information at HMM level – Numerator ( num ) from correct transcription – Denominator ( den ) for confusable hypotheses from recognition • Use Extended Baum-Welch (Gopalakrishnan et al, Normandin) updates e.g. for means θ num jm ( O ) − θ den � � jm ( O ) + Dµ jm µ jm = ˆ � γ num jm − γ den � + D jm – Gaussian occupancies (summed over time) are γ jm from forward-backward – θ jm ( O ) is sum of data, weighted by occupancy. • For rapid convergence use Gaussian-specific D-constant • For better generalisation broaden posterior probability distribution – Acoustic scaling – Weakened language model (unigram) Cambridge University Rich Transcription Workshop 2002 8 Engineering Department
Woodland, Evermann, Gales, Hain, Liu, Moore, Povey & Wang: CU-HTK April 2002 Switchboard system Lattice Implementation of MPE • Problem: RawAccuracy( s ) , defined on sentence level as (#correct - #inserted) requires alignment with correct transcription • Express RawAccuracy( s ) as a sum of PhoneAcc( q ) for all phones q in the sentence hypothesis s :   1 if correct phone   PhoneAcc( q ) = 0 if substitution − 1 if insertion   • Calculating PhoneAcc( q ) still requires alignment to reference transcription • Use an approximation to PhoneAcc( q ) based on time-alignment information – compute the proportion e that each hypothesis phone overlaps the reference – gives a lower-bound on true value of RawAccuracy( s ) Cambridge University Rich Transcription Workshop 2002 9 Engineering Department
Woodland, Evermann, Gales, Hain, Liu, Moore, Povey & Wang: CU-HTK April 2002 Switchboard system Approximating PhoneAcc using Time Information � � − 1 + 2 e if same phone PhoneAcc( q ) = − 1 + e if different phone a b c Reference a b b d Hypothesis 1.0 0.8 0.2 0.15 0.85 Proportion e 1.0 0.6 −0.6 −0.85 −0.15 −1 + (correct:2*e, incorrect:e) 1.0 0.6 −0.6 −0.15 Max of above Approximated sentence raw accuracy from above = 0.85 Exact value of raw accuracy: 2 corr − 1 ins = 1 Cambridge University Rich Transcription Workshop 2002 10 Engineering Department
Woodland, Evermann, Gales, Hain, Liu, Moore, Povey & Wang: CU-HTK April 2002 Switchboard system PhoneAcc Approximation For Lattices Calc PhoneAcc( q ) for each phone q , then find ∂ F MPE ( λ ) ∂ log p ( q ) (forward-backward) b f a c Correct b b 0.6 b 0.6 d −0.15 a 1.0 Hypothesis lattice c −0.2 (PhoneAcc) a 1.0 d −0.15 b 1.0 b −0.15 b −0.177 d −0.177 a −0.15 c −0.022 dF / d(phone lgprob) a 0.15 d 0.177 b 0.177 Better than average path Worse than average path Cambridge University Rich Transcription Workshop 2002 11 Engineering Department
Woodland, Evermann, Gales, Hain, Liu, Moore, Povey & Wang: CU-HTK April 2002 Switchboard system Applying Extended Baum-Welch to MPE • Use EBW update formulae as for MMIE but with modified MPE statistics ∂ F MMIE ( λ ) 1 • For MMIE, the occupation probability for an arc q equals for ∂ log p ( q ) κ numerator ( ×− 1 for the denominator). The denominator occupancy-weighted statistics are subtracted from the numerator in the update formulae ∂ F MPE ( λ ) • Statistics for MPE update use 1 ∂ log p ( q ) of the criterion w.r.t. the phone arc κ log likelihood which can be calculated efficiently • Either MPE numerator or denominator statistics are updated depending on the sign of ∂ F MPE ( λ ) ∂ log p ( q ) , which is the “MPE arc occupancy” • After accumulating statistics, apply EBW equations • EBW is viewed as a gradient descent technique and can be shown to be a valid update for MPE. Cambridge University Rich Transcription Workshop 2002 12 Engineering Department
Recommend
More recommend