automatic speech recognition cs753 automatic speech
play

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent Neural Network (RNN) Models for ASR Instructor: Preethi Jyothi Feb 9, 2017 Recap: Hybrid DNN-HMM Systems Triphone state labels (DNN


  1. Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent Neural Network (RNN) Models for ASR Instructor: Preethi Jyothi Feb 9, 2017 


  2. Recap: Hybrid DNN-HMM Systems Triphone state labels 
 (DNN posteriors) Instead of GMMs, use scaled • DNN posteriors as the HMM observation probabilities DNN trained using triphone • … …… labels derived from a forced alignment “Viterbi” step. 39 features in one frame Forced alignment: Given a training • u tu erance { O , W }, find the most likely sequence of states (and hence triphone state labels) using a set of trained triphone HMM models, M . Here M is constrained Fixed window of 
 by the triphones in W . 5 speech frames

  3. Recap: Tandem DNN-HMM Systems Output Layer Neural network outputs are • used as “features” to train HMM-GMM models Bottleneck Layer Use a low-dimensional • bo tu leneck layer representation to extract features from the bo tu leneck layer 
 Input Layer

  4. Feedforward DNNs we’ve seen so far… Assume independence among the training instances • Independent decision made about classifying each • individual speech frame Network state is completely reset a fu er each speech 
 • frame is processed This independence assumption fails for data like speech which • has temporal and sequential structure

  5. Recurrent Neural Networks Recurrent Neural Networks (RNNs) work naturally with • sequential data and process it one element at a time HMMs also similarly a tu empt to model time dependencies. • How’s it di ff erent? HMMs are limited by the size of the state space. Inference • becomes intractable if the state space grows very large! What about RNNs? •

  6. RNN definition y t y 1 y 2 y 3 unfold H, O H, O H, O H, O … h 0 h 1 h 2 h t x t x 1 x 2 x 3 Two main equations govern RNNs: h t = H(Wx t + Vh t-1 + b (h) ) y t = O(Uh t + b (y) ) where W, V, U are matrices of input-hidden weights, hidden-hidden 
 weights and hidden-output weights resp; b (y) and b (y) are bias vectors

  7. Recurrent Neural Networks Recurrent Neural Networks (RNNs) work naturally with • sequential data and process it one element at a time HMMs also similarly a tu empt to model time dependencies. • How’s it di ff erent? HMMs are limited by the size of the state space. Inference • becomes intractable if the state space grows very large! What about RNNs? RNNs are designed to capture long- • range dependencies unlike HMMs: Network state is exponential in the number of nodes in a hidden layer

  8. Training RNNs An unrolled RNN is just a very deep feedforward network • For a given input sequence: • create the unrolled network • add a loss function node to the network • then, use backpropagation to compute the gradients • This algorithm is known as backpropagation through time • (BPTT) 


  9. Deep RNNs y 1 y 2 y 3 H, O H, O H, O h 2,2 h 1,2 h 0,2 H, O H, O H, O h 0,1 h 1,1 h 2,1 x 1 x 2 x 3 RNNs can be stacked in layers to form deep RNNs • Empirically shown to perform be tu er than shallow RNNs on • ASR [G13] [G13] A. Graves, A . Mohamed, G. Hinton, “Speech Recognition with Deep Recurrent Neural Networks”, ICASSP, 2013.

  10. Vanilla RNN Model h t = H(Wx t + Vh t-1 + b (h) ) y t = O(Uh t + b (y) ) H : element wise application of the sigmoid or tanh function O : the so fu max function Run into problems of exploding and vanishing gradients.

  11. Exploding/Vanishing Gradients In deep networks, gradients in early layers is computed as the • product of terms from all the later layers This leads to unstable gradients: • If the terms in later layers are large enough, gradients in early • layers (which is the product of these terms) can grow exponentially large: Exploding gradients If the terms are in later layers are small, gradients in early • layers will tend to exponentially decrease: Vanishing gradients To address this problem in RNNs, Long Short Term Memory • (LSTM) units were proposed [HS97] [HS97] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” 
 Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

  12. Long Short Term Memory Cells Input Memory Output ⊗ ⊗ Gate Cell Gate ⊗ Forget Gate Memory cell: Neuron that stores information over long time • periods Forget gate: When on, memory cell retains previous contents. • Otherwise, memory cell forgets contents. When input gate is on, write into memory cell • When output gate is on, read from the memory cell •

  13. Bidirectional RNNs concat concat concat y 2,b y 3,b y 2,f y 1,f y 3,f y 1,b H b , O b H b , O b Backward 
 H b , O b h 3,b h 2,b h 1,b h 0,b layer H f , O f H f , O f H f , O f Forward 
 h 3,f h 0,f h 1,f h 2,f layer x hello x world x . BiRNNs process the data in both directions with two separate 
 • hidden layers Outputs from both hidden layers are concatenated at each • position

  14. Automatic Speech Recognition (CS753) RNN-based ASR system CS 753 Feb 9, 2017 


  15. ASR with RNNs Neural networks in ASR systems are typically a single component • (aka acoustic models) in a complex pipeline Limitations: • 1. Frame-level training targets derived from HMM-based alignments 2. Objective function optimized in NNs is very di ff erent from the final evaluation metric • Goal: Single RNN model that addresses these issues and replaces as much of the speech pipeline as possible [G14] [G14] A. Graves, N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks”, ICML, 2014.

  16. RNN Architecture y t-1 y t y t+1 H b , O b H b , O b H b , O b h 3,b h 2,b h 1,b h 0,b H f , O f H f , O f H f , O f h 3,f h 0,f h 1,f h 2,f x t-1 x t x t+1 H was implemented using LSTMs in [G14]. Input: Acoustic • feature vectors, one per frame; Output: Characters + space Deep bidirectional LSTM networks were used • [G14] A. Graves, N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks”, ICML, 2014.

  17. Connectionist Temporal Classification (CTC) T X Y Pr( y | x ) = Pr( a | x ) where Pr( a | x ) = Pr( a t , t | x ) … (1) t =1 a ∈ B − 1( y ) For a target y ∗ , CTC( x ) = − log Pr( y ∗ | x ) … (2) X X L ( x ) = Pr( y | x ) L ( x, y ) = Pr( a | x ) L ( x, B ( a )) … (3) y a For an input sequence x of length T , Eqn (1) gives the probability of an • output transcription y ; a is a CTC alignment of y Given a target transcription y * , the CTC objective function to be minimised is • given in Eqn (2) Modify loss function as shown in Eqn (3) to be a be tu er match to the final test • criteria; here, is a transcription loss function x ) L ( x, y ) = L ( x ) = needs to be minimised: Use a Monte-carlo sampling-based algorithm •

  18. Decoding First approximation: For a given test input sequence x , pick the • most probable output at each time step arg max Pr( y | x ) ≈ B (arg max Pr( a | x )) y a More accurate decoding uses a search algorithm that also • makes use of a dictionary and a language model. (Decoding search algorithms will be discussed in detail in later lectures.)

  19. WER results System LM WER RNN-CTC Dictionary only 24.0 RNN-CTC Bigram 10.4 RNN-CTC Trigram 8.7 RNN-WER Dictionary only 21.9 RNN-WER Bigram 9.8 RNN-WER Trigram 8.2 Baseline Bigram 9.4 Baseline Trigram 7.8 [G14] A. Graves, N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks”, ICML, 2014.

  20. Some erroneous examples produced by the end-to-end RNN Target: “There’s unrest but we’re not going to lose them to Dukakis” Output: “There’s unrest but we’re not going to lose them to Dekakis ” Target: “T. W. A. also plans to hang its boutique shingle in airports at Lambert Saint” Output: “T. W. A. also plans tohing its bootik single in airports at Lambert Saint”

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend