lecture 14
play

Lecture 14 Advanced Neural Networks Michael Picheny, Bhuvana - PowerPoint PPT Presentation

Lecture 14 Advanced Neural Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 27


  1. Lecture 14 Advanced Neural Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 27 th April 2016

  2. Variants of Neural Network Architectures Deep Neural Network (DNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), unidirectional, bidirectional, Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Constraints and Regularization, Attention model, 2 / 72

  3. Training Observations and labels ( x n , a n ) ∈ R D × A for n = 1 , . . . , N . Training criterion: N F CE ( θ ) = − 1 � log P ( a n | x n , θ ) N n = 1 N F L ( θ ) = 1 � � � P ( a T n 1 | x T n 1 , θ ) · L ( ω, ω n ) loss L N n = 1 ω a Tn 1 ∈ ω Optimization: θ = arg min {F ( θ ) } θ θ, θ : Free parameters of the model (NN, GMM). ω, ω : Word sequences. 3 / 72

  4. Recap: Gaussian Mixture Model Recap Gaussian Mixture Model: T � � P ( ω | x T 1 ) = P ( x t | a t ) P ( a t | a t − 1 ) a T t = 1 1 ∈ ω ω : word sequence x T 1 := x 1 , . . . , x T : feature sequence a T 1 := a 1 , . . . , a T : HMM state sequence Emission probability P ( x | a ) ∼ N ( µ a , Σ a ) Gaussian. Replace with a neural network ⇒ hybrid model. Use neural network for feature extraction ⇒ bottleneck features. 4 / 72

  5. Hybrid Model Gaussian Mixture Model: T � � P ( ω | x T 1 ) = P ( x t | a t ) P ( a t | a t − 1 ) � �� � � �� � a T t = 1 1 ∈ ω emission transition Training: A neural network usually models P ( x | a ) . Recognition: Use as a hybrid model for speech recognition: P ( a | x ) P ( x ) P ( a ) = P ( x | a ) P ( x , a ) = P ( x ) ≈ P ( x | a ) P ( a ) P ( x | a ) / P ( x ) and P ( x | a ) are proportional. 5 / 72

  6. Hybrid Model and Bayes Decision Rule � � P ( ω ) P ( x T 1 | ω ) ω = arg max ˆ ω   T  P ( x t | a t )  � � P ( x t ) P ( a t | a t − 1 ) = arg max  P ( ω ) ω  a T t = 1 1 ∈ ω   T  �    P ( x t | a t ) P ( a t | a t − 1 )         � t = 1 = arg max P ( ω ) T ω   �   a T 1 ∈ ω  P ( x t )        t = 1   T   � � P ( x t | a t ) P ( a t | a t − 1 ) = arg max  P ( ω ) ω  a T t = 1 1 ∈ ω 6 / 72

  7. Where Are We? Recap: Deep Neural Network 1 Multilingual Bottleneck Features 2 Convolutional Neural Networks 3 Recurrent Neural Networks 4 Unstable Gradient Problem 5 Attention-based End-to-End ASR 6 7 / 72

  8. Recap: Deep Neural Network (DNN) First feed forward networks. Consists of input, multiple hidden and output layer. Each hidden and output layer consists of nodes. 8 / 72

  9. Recap: Deep Neural Network (DNN) Free parameters: weights W and bias b . Output of a layer is input to the next layer. Each node performs a linear followed by a non-linear activiation on the input. The output layer relates the output of the last hidden layer with the target states. 9 / 72

  10. Neural Network Layer Number of nodes: n l in layer l . Input from previous layer: y ( l − 1 ) ∈ R n l − 1 Weight and bias : W ( l ) ∈ R n l − 1 × n l , b ( l ) ∈ R n l . Activation: y ( l ) = σ ( W ( l ) · y ( l − 1 ) + b ( l ) ) � �� � linear � �� � non-linear 10 / 72

  11. Deep Neural Network (DNN) 11 / 72

  12. Activation Function Zoo Sigmoid: 1 σ sigmoid ( y ) = 1 + exp ( − y ) Hyperbolic tangent: σ tanh ( y ) = tanh ( y ) = 2 σ sigmoid ( 2 y ) REctified Linear Unit (RELU): � y , y > 0 σ relu ( y ) = 0 , y ≤ 0 12 / 72

  13. Activation Function Zoo Parametric RELU (PRELU): � y , y > 0 σ prelu ( y ) = a · y , y ≤ 0 Exponential Linear Unit (ELU): � y , y > 0 σ elu ( y ) = a · ( exp ( y ) − 1 ) , y ≤ 0 Maxout: � W 1 · y ( l − 1 ) + b 1 , . . . , W I · y ( l − 1 ) + b I � σ maxout ( y 1 , . . . , y I ) = max i Softmax: � exp ( y 1 ) � T Z ( y ) , . . . , exp ( y I ) � σ softmax ( y ) = with Z ( y ) = exp ( y j ) Z ( y ) j 13 / 72

  14. Where Are We? Recap: Deep Neural Network 1 Multilingual Bottleneck Features 2 Convolutional Neural Networks 3 Recurrent Neural Networks 4 Unstable Gradient Problem 5 Attention-based End-to-End ASR 6 14 / 72

  15. Multilingual Bottleneck 15 / 72

  16. Multilingual Bottleneck Encoder-Decoder architecture: DNN with a bottleneck. Forces low-dimensional representation of speech across mutliple languages. Several languages are presented to the network randomly. Training: Labels from different languages. Recognition: Network is cut off after bottleneck. 16 / 72

  17. Why are Multilingual Bottlenecks ? Train Multilingual Bottleneck features with lots of data. Future use: Bottleneck features on different tasks to train GMM system. No expensive DNN training, but WER gains similar to DNN. 17 / 72

  18. Multilingual Bottleneck: Performance WER [%] Model FR EN DE PL MFCC 23.6 28.6 23.3 18.1 MLP BN targets 19.3 23.1 19.0 14.5 MLP BN multi 18.7 21.3 17.9 14.0 deep BN targets 17.4 20.3 17.3 13.0 deep BN multi 17.1 19.7 16.4 12.6 +lang.dep. hidden layer 16.8 19.7 16.2 12.4 18 / 72

  19. More Fancy Models Convolutional Neural Networks. Recurrent Neural Networks: Long Short-Term Memory (LSTM) RNNs, Gated Recurrent Unit (GRU) RNNs. Unstable Gradient Problem. 19 / 72

  20. Where Are We? Recap: Deep Neural Network 1 Multilingual Bottleneck Features 2 Convolutional Neural Networks 3 Recurrent Neural Networks 4 Unstable Gradient Problem 5 Attention-based End-to-End ASR 6 20 / 72

  21. Convolutional Neural Networks (CNNs) Convolution (remember signal analysis ?): � ( x 1 ∗ x 2 )[ k ] = x 1 [ k − i ] · x 2 [ i ] i 21 / 72

  22. Convolutional Neural Networks (CNNs) Convolution (remember signal analysis ?): � ( x 1 ∗ x 2 )[ k ] = x 1 [ k − i ] · x 2 [ i ] i 22 / 72

  23. Convolutional Neural Networks (CNNs) 23 / 72

  24. CNNs Consists of multiple local maps with channels and kernels. Kernels are convolved across the input. Multidimensional input: 1D (frequency), 2D (time-frequency), 3D (time-frequency-?). Neurons are connected to a local receptive fields of input. Weights are shared across multiple receptive fields. 24 / 72

  25. Formal Definition: Convolutional Neural Networks Free parameters: Feature maps W n ∈ R C × k bias b n ∈ R k for n = 1 , . . . , N c = 1 , . . . , C channels, k ∈ N kernel size Activation function: y n , i = σ ( W n , i ∗ x i + b )   C i + k � � = σ W n , c , i − j x c , j + b f   c = 1 j = i − k 25 / 72

  26. Pooling Max-Pooling: j = i − k ,..., i + k { y n , c , j } pool ( y n , c , i ) = max Average-Pooling: i + k 1 � average ( y n , c , i ) = y n , c , j 2 · k + 1 j = i − k 26 / 72

  27. CNN vs. DNN: Performance GMM, DNN use fMLLR features. CNN use log-Mel features which have local structure, opposed to speaker normalized features. Table: Broadcast News 50 h. Table: Broadcast conversation 2k h. WER [%] Model CE ST WER [%] Model CE ST GMM 18.8 n/a DNN 11.7 10.3 DNN 16.2 14.9 CNN 12.6 10.4 CNN 15.8 13.9 DNN+CNN 11.3 9.6 CNN+DNN 15.1 13.2 27 / 72

  28. VGG # Fmaps Classic [16, 17, 18] VB(X) VC(X) VD(X) WD(X) 64 conv(3,64) conv(3,64) conv(3,64) conv(3,64) conv(64,64) conv(64,64) conv(64,64) conv(64,64) pool 1x3 pool 1x2 pool 1x2 pool 1x2 128 conv(64, 128) conv(64, 128) conv(64, 128) conv(64, 128) conv(128, 128) conv(128, 128) conv(128, 128) conv(128, 128) pool 2x2 pool 2x2 pool 1x2 pool 1x2 256 conv(128, 256) conv(128, 256) conv(128, 256) conv(256, 256) conv(256, 256) conv(256, 256) conv(256, 256) pool 1x2 pool 2x2 pool 2x2 512 conv9x9(3,512) conv(256, 512) conv(256, 512) pool 1x3 conv(512, 512) conv(512, 512) conv3x4(512,512) conv(512, 512) pool 2x2 pool 2x2 FC 2048 FC 2048 (FC 2048) FC output size Softmax 28 / 72 KUR TOK CEB KAZ TEL LIT Softmax Softmax Softmax Softmax Softmax Softmax FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC pool conv Shared conv pool conv conv Context +/-5 Context +/-10, stride 2 Context +/- 20, stride 4

  29. VGG KUR TOK CEB KAZ TEL LIT Softmax Softmax Softmax Softmax Softmax Softmax FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC pool conv Shared conv pool conv conv Context +/-5 29 / 72 Context +/-10, stride 2 Context +/- 20, stride 4

  30. VGG Performance WER # params (M) #M frames Classic 512 [17] 13.2 41.2 1200 Classic 256 ReLU (A+S) 13.8 58.7 290 VCX (6 conv) (A+S) 13.1 36.9 290 VDX (8 conv) (A+S) 12.3 38.4 170 WDX (10 conv) (A+S) 12.2 41.3 140 VDX (8 conv) (S) 11.9 38.4 340 WDX (10 conv) (S) 11.8 41.3 320 30 / 72

  31. Where Are We? Recap: Deep Neural Network 1 Multilingual Bottleneck Features 2 Convolutional Neural Networks 3 Recurrent Neural Networks 4 Unstable Gradient Problem 5 Attention-based End-to-End ASR 6 31 / 72

  32. Recurrent Neural Networks (RNNs) DNNs are deep in layers. RNNs are deep in time (in addition). Shared weights and biases across time steps. 32 / 72

  33. Unfolded RNN 33 / 72

  34. DNN vs. RNN 34 / 72

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend