neural networks
play

Neural Networks for Distant Speech Recognition Steve Renals ! Joint - PowerPoint PPT Presentation

C S T R Neural Networks for Distant Speech Recognition Steve Renals ! Joint work with ! Centre for Speech Technology Research ! Pawel wi toja ski University of Edinburgh ! ! s.renals@ed.ac.uk 14 May 2014 ! Significant contributions


  1. C S T R Neural Networks for Distant Speech Recognition Steve Renals ! Joint work with ! Centre for Speech Technology Research ! Pawel Ś wi ę toja ń ski University of Edinburgh ! ! s.renals@ed.ac.uk 14 May 2014 ! Significant contributions from Peter Bell & Arnab Ghoshal

  2. C Distant Speech Recognition S T R hmm ... so you have your energy source your user interface who’s controlling the chip ... click rustle

  3. C S T Why study meetings? R • Natural communication scenes ! • Multistream - multiple asynchronous streams of data ! • Multimodal - words, prosody, gesture, attention ! • Multiparty - social roles, individual and group behaviours ! • Meetings offer realistic, complex behaviours but in a circumscribed setting ! • Applications based on meeting capture, analysis, recognition and interpretation ! • Great arena for interdisciplinary research

  4. C S T “ASR Complete” problem R • Transcription of conversational speech ! • Distant speech recognition with microphone arrays ! • Speech separation, multiple acoustic channels ! • Reverberation ! • Overlap detection ! • Utterance and speaker segmentation ! • Disfluency detection

  5. C S T Today’s Menu R • MDM corpora: ICSI and AMI meetings corpora ! • MDM systems in 2010: GMMs, beamforming, and lots of adaptation ! • MDM systems in 2014: Neural networks, less beamforming, and less adaptation

  6. C S T R Corpora

  7. C S ICSI Corpus T R Headset mics Tabletop boundary mics

  8. C S T R AMI Corpus Headset mic Lapel mic Mic Array http://corpus.amiproject.org

  9. C AMI Corpus Example S T R

  10. C Meeting recording S T R (c. 2005)

  11. C S Meeting recording (2010s) T R

  12. C S T R GMM-based systems ! (State-of-the-art 2010)

  13. C Basic system S BEAMFORMER T beamformer beamformer.env R • Speech/non-speech segmentation ! • PLP/MFCC features ! • ML trained HMM/GMM system (122k 39D Gaussians) ! • 50k vocabulary ! • Trigram language model (small: 26M words, PPL 78) ! • Weighted FST decoder

  14. C S T Additional components R • Microphone array front end ! • Speaker / channel adaptation ! • Vocal tract length normalisation (VTLN) ! • Maximum likelihood linear regression (MLLR) ! • Input feature transform – LDA/STC ! • Discriminative training ! • eg boosted maximum mutual information, BMMI ! • Discriminative features ! • Model combination

  15. C GMM results (WER) S T R 70 ASR Word Error Rates for GMM/HMM Systems AMI 63.2 60 ICSI AMI 56.1 54.8 50 ICSI 46.8 40 WER/% AMI 30 29.6 20 10 0 SDM MDM beamforming IHM 0

  16. C Microphone array processing ! S T R for distant speech recognition • Mic array processing in AMIDA ASR system (Hain et al, 2012) ! • Wiener noise filter ! • Filter-sum beamforming based on time-delay-of-arrival ! • Viterbi smoother post processing ! • Track direction of maximum energy • Optimise beamforming for speech recognition ! • LIMABEAM (Seltzer et al, 2004, 2006) [explicit] ! • Simply concatenate feature vectors from multiple mics (Marino and Hain, 2011) [implicit]

  17. C S T R (Deep) Neural Networks

  18. The Perceptron C S T R (Rosenblatt)

  19. The Perceptron C S T R (Rosenblatt)

  20. The Perceptron C S T R (Rosenblatt) NN Winter #1

  21. MLPs and backprop ! C S T R (mid 1980s)

  22. C MLPs and backprop S T R • Train multiple layers of Outputs y K y 1 y � hidden units – nested δ K δ � δ 1 nonlinear functions ! w (2) w ( 2 ) w ( 2 ) K j 1 j � j • Powerful feature detectors ! Hidden units • Posterior probability � z j δ j = h � ( b j ) δ � w � j estimation ! � w (1) • Theorem: any ji function can be approximated with a x i single hidden layer

  23. “Hybrid” Neural network C S T R acoustic models (1990s) Perceptual Error (%) Chronos DARPA RM 1992 Linear Decoder 11.0 CI-HMM Prediction CI RNN 10.0 Speech 9.0 Modulation Chronos 8.0 ROVER Spectrogram Decoder 7.0 CI MLP 6.0 CI-MLP Utterance Hypothesis 5.0 Perceptual Chronos CD-HMM Linear Decoder 4.0 Prediction MIX 3.0 CD RNN 2.0 Broadcast news 1998 ! 1.0 20.8% WER ! 0.0 0 1 2 3 4 5 6 Million Parameters (best GMM-based system, 13.5%) ! Bourlard & Morgan, 1994 ! Cook, Christie, Ellis, Fosler-Lussier, Gotoh, ! Robinson, IEEE TNN 1994 Kingsbury, Morgan, Renals, Robinson, & Williams, DARPA, 1999 Renals, Morgan, Cohen & Franco, ICASSP 1992

  24. NN acoustic models C S T R Limitations vs GMMs • Computationally restricted to monophone outputs ! • CD-RNN factored over multiple networks – limited within-word context ! • Training not easily parallelisable ! • experimental turnaround slower ! • systems less complex (fewer parameters) ! • RNN – <100k parameters ! • MLP – ~1M parameters ! • Rapid adaptation hard (cf MLLR)

  25. C S T R s-iy+l f-iy-l t-iy-n t-iy-m GMM SVM CRF NN Winter #2

  26. Discriminative long-term C S T R features – Tandem • A neural network-based technique provided the biggest increase in accuracy in speech recognition during the 2000s • Tandem features (Hermansky, Ellis & Sharma, 2000) ! • use (transformed) outputs or (bottleneck) hidden values as input features for a GMM ! • deep networks – e.g. 5 layer MLP to obtain bottleneck features (Grézl, Karafiát, Kontár & Č ernock ý , 2007) ! • reduces errors by about 10% relative (Hain, Burget, Dines, Garner, Grezl, el Hannani, Huijbregts, Karafiat, Lincoln & Wan, 2012)

  27. Deep Neural Networks C S T R (2010s) CD Hybrid Phone Outputs 12000 Tandem Bottleneck layer ! 26 ! 3–8 hidden layers Hidden units 2000 MFCC Inputs Dahl, Yu, Deng & Acero, ! IEEE TASLP2012 (39*9=351) Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Sainath & Kingsbury, IEEE SP Mag 2012

  28. C Deep neural networks S T R What’s new?

  29. C Deep neural networks S T R 1. Unsupervised pretraining (Hinton, Osindero & Teh, 2006) ! • Train a stacked RBM generative model, then finetune ! • Good initialisation ! • Regularisation 2. Deep – many hidden layers ! • Deeper models more accurate ! • GPUs gave us the computational power 3. Wide output layer (context dependent phone classes) rather than factorised into multiple nets ! • More accurate phone models ! • GPUs gave us the computational power

  30. C Deep neural networks S T R 1. Unsupervised pretraining (Hinton, Osindero & Teh, 2006) ! • Train a stacked RBM generative model, then finetune ! • Good initialisation ! • Regularisation 2. Deep – many hidden layers ! • Deeper models more accurate ! • GPUs gave us the computational power 3. Wide output layer (context dependent phone classes) rather than factorised into multiple nets ! • More accurate phone models ! • GPUs gave us the computational power

  31. C K Vesely, A Ghoshal, L Burget, and D Povey, ! S “Sequence-discriminative training of deep neural networks”, Interspeech–2013. T R Switchboard 35 Hub5 '00 test set CHE 300 hour training set 33.0 30 CHE AVE CHE 25 25.8 25.7 24.1 AVE SWB 20 AVE 20.0 WER/% 18.6 18.4 SWB 15 SWB 14.2 12.6 10 5 0 ---GMM/BMMI--- ----DNN/CE---- ---DNN/sMBR---

  32. C K Vesely, A Ghoshal, L Burget, and D Povey, ! S “Sequence-discriminative training of deep neural networks”, Interspeech–2013. T R Switchboard 35 Hub5 '00 test set CHE 300 hour training set 33.0 30 CHE AVE CHE 25 25.8 25.7 http://kaldi.sf.net/ ! 24.1 AVE SWB 20 AVE 20.0 WER/% 18.6 18.4 SWB 15 SWB 14.2 12.6 10 5 0 ---GMM/BMMI--- ----DNN/CE---- ---DNN/sMBR---

  33. Neural network ! C S T R acoustic models Softmax output layer ~6000 CD phone outputs ~2000 hidden units Automatically learned ! 3-8 hidden layers feature extraction Aim to learn representations for distant speech recognition based multiple mic channels 9x39 MFCC inputs

  34. Neural network ! C S T R acoustic models Softmax output layer ~6000 CD phone outputs ~2000 hidden units Automatically learned ! 3-8 hidden layers feature extraction Aim to learn representations for distant speech recognition based multiple mic channels Multi-channel input ! 9x39 MFCC inputs Spectral domain?

  35. C Neural network acoustic models S T R for distant speech recognition • NNs have proven to result in accurate systems for a variety of tasks – TIMIT, WSJ, Switchboard, Broadcast News, Lectures, Aurora4, … • NNs can integrate information from multiple frames of data (in comparison with GMMs) • NNs can construct feature representations, from multiple sources of data • NNs are well suited to learning multiple modules with a common objective function

  36. C Baseline DNN system S T R ~4000 tied state outputs 50,000 word pronunciation dictionary ! ! Small trigram LM ! (PPL 78, trained on 26M words) 2048 hidden units 6 hidden layers mic array Wiener filter noise cancellation Smoothed tdoa estimates 11x120 FBANK inputs Delay-sum beamforming

  37. C Baseline GMM results S T R 70 ASR Word Error Rates for GMM/HMM Systems AMI 63.2 60 ICSI AMI 56.1 54.8 50 ICSI 46.8 40 WER/% AMI 30 29.6 20 10 0 SDM MDM beamforming IHM 0

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend