Lecture 14 Advanced Neural Networks Michael Picheny, Bhuvana - PowerPoint PPT Presentation

Lecture 14 Advanced Neural Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 27 th April 2016

Variants of Neural Network Architectures Deep Neural Network (DNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), unidirectional, bidirectional, Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Constraints and Regularization, Attention model, 2 / 72

Training Observations and labels ( x n , a n ) ∈ R D × A for n = 1 , . . . , N . Training criterion: N F CE ( θ ) = − 1 � log P ( a n | x n , θ ) N n = 1 N F L ( θ ) = 1 � � � P ( a T n 1 | x T n 1 , θ ) · L ( ω, ω n ) loss L N n = 1 ω a Tn 1 ∈ ω Optimization: θ = arg min {F ( θ ) } θ θ, θ : Free parameters of the model (NN, GMM). ω, ω : Word sequences. 3 / 72

Recap: Gaussian Mixture Model Recap Gaussian Mixture Model: T � � P ( ω | x T 1 ) = P ( x t | a t ) P ( a t | a t − 1 ) a T t = 1 1 ∈ ω ω : word sequence x T 1 := x 1 , . . . , x T : feature sequence a T 1 := a 1 , . . . , a T : HMM state sequence Emission probability P ( x | a ) ∼ N ( µ a , Σ a ) Gaussian. Replace with a neural network ⇒ hybrid model. Use neural network for feature extraction ⇒ bottleneck features. 4 / 72

Hybrid Model and Bayes Decision Rule � � P ( ω ) P ( x T 1 | ω ) ω = arg max ˆ ω   T  P ( x t | a t )  � � P ( x t ) P ( a t | a t − 1 ) = arg max  P ( ω ) ω  a T t = 1 1 ∈ ω   T  �    P ( x t | a t ) P ( a t | a t − 1 )         � t = 1 = arg max P ( ω ) T ω   �   a T 1 ∈ ω  P ( x t )        t = 1   T   � � P ( x t | a t ) P ( a t | a t − 1 ) = arg max  P ( ω ) ω  a T t = 1 1 ∈ ω 6 / 72

Where Are We? Recap: Deep Neural Network 1 Multilingual Bottleneck Features 2 Convolutional Neural Networks 3 Recurrent Neural Networks 4 Unstable Gradient Problem 5 Attention-based End-to-End ASR 6 7 / 72

Recap: Deep Neural Network (DNN) First feed forward networks. Consists of input, multiple hidden and output layer. Each hidden and output layer consists of nodes. 8 / 72

Recap: Deep Neural Network (DNN) Free parameters: weights W and bias b . Output of a layer is input to the next layer. Each node performs a linear followed by a non-linear activiation on the input. The output layer relates the output of the last hidden layer with the target states. 9 / 72

Neural Network Layer Number of nodes: n l in layer l . Input from previous layer: y ( l − 1 ) ∈ R n l − 1 Weight and bias : W ( l ) ∈ R n l − 1 × n l , b ( l ) ∈ R n l . Activation: y ( l ) = σ ( W ( l ) · y ( l − 1 ) + b ( l ) ) � �� linear � �� non-linear 10 / 72

Deep Neural Network (DNN) 11 / 72

Activation Function Zoo Sigmoid: 1 σ sigmoid ( y ) = 1 + exp ( − y ) Hyperbolic tangent: σ tanh ( y ) = tanh ( y ) = 2 σ sigmoid ( 2 y ) REctified Linear Unit (RELU): � y , y > 0 σ relu ( y ) = 0 , y ≤ 0 12 / 72

Activation Function Zoo Parametric RELU (PRELU): � y , y > 0 σ prelu ( y ) = a · y , y ≤ 0 Exponential Linear Unit (ELU): � y , y > 0 σ elu ( y ) = a · ( exp ( y ) − 1 ) , y ≤ 0 Maxout: � W 1 · y ( l − 1 ) + b 1 , . . . , W I · y ( l − 1 ) + b I � σ maxout ( y 1 , . . . , y I ) = max i Softmax: � exp ( y 1 ) � T Z ( y ) , . . . , exp ( y I ) � σ softmax ( y ) = with Z ( y ) = exp ( y j ) Z ( y ) j 13 / 72

Multilingual Bottleneck 15 / 72

Multilingual Bottleneck Encoder-Decoder architecture: DNN with a bottleneck. Forces low-dimensional representation of speech across mutliple languages. Several languages are presented to the network randomly. Training: Labels from different languages. Recognition: Network is cut off after bottleneck. 16 / 72

Why are Multilingual Bottlenecks ? Train Multilingual Bottleneck features with lots of data. Future use: Bottleneck features on different tasks to train GMM system. No expensive DNN training, but WER gains similar to DNN. 17 / 72

Multilingual Bottleneck: Performance WER [%] Model FR EN DE PL MFCC 23.6 28.6 23.3 18.1 MLP BN targets 19.3 23.1 19.0 14.5 MLP BN multi 18.7 21.3 17.9 14.0 deep BN targets 17.4 20.3 17.3 13.0 deep BN multi 17.1 19.7 16.4 12.6 +lang.dep. hidden layer 16.8 19.7 16.2 12.4 18 / 72

More Fancy Models Convolutional Neural Networks. Recurrent Neural Networks: Long Short-Term Memory (LSTM) RNNs, Gated Recurrent Unit (GRU) RNNs. Unstable Gradient Problem. 19 / 72

Convolutional Neural Networks (CNNs) Convolution (remember signal analysis ?): � ( x 1 ∗ x 2 )[ k ] = x 1 [ k − i ] · x 2 [ i ] i 21 / 72

Convolutional Neural Networks (CNNs) Convolution (remember signal analysis ?): � ( x 1 ∗ x 2 )[ k ] = x 1 [ k − i ] · x 2 [ i ] i 22 / 72

Convolutional Neural Networks (CNNs) 23 / 72

CNNs Consists of multiple local maps with channels and kernels. Kernels are convolved across the input. Multidimensional input: 1D (frequency), 2D (time-frequency), 3D (time-frequency-?). Neurons are connected to a local receptive fields of input. Weights are shared across multiple receptive fields. 24 / 72

Formal Definition: Convolutional Neural Networks Free parameters: Feature maps W n ∈ R C × k bias b n ∈ R k for n = 1 , . . . , N c = 1 , . . . , C channels, k ∈ N kernel size Activation function: y n , i = σ ( W n , i ∗ x i + b )   C i + k � � = σ W n , c , i − j x c , j + b f   c = 1 j = i − k 25 / 72

Pooling Max-Pooling: j = i − k ,..., i + k { y n , c , j } pool ( y n , c , i ) = max Average-Pooling: i + k 1 � average ( y n , c , i ) = y n , c , j 2 · k + 1 j = i − k 26 / 72

CNN vs. DNN: Performance GMM, DNN use fMLLR features. CNN use log-Mel features which have local structure, opposed to speaker normalized features. Table: Broadcast News 50 h. Table: Broadcast conversation 2k h. WER [%] Model CE ST WER [%] Model CE ST GMM 18.8 n/a DNN 11.7 10.3 DNN 16.2 14.9 CNN 12.6 10.4 CNN 15.8 13.9 DNN+CNN 11.3 9.6 CNN+DNN 15.1 13.2 27 / 72

VGG # Fmaps Classic [16, 17, 18] VB(X) VC(X) VD(X) WD(X) 64 conv(3,64) conv(3,64) conv(3,64) conv(3,64) conv(64,64) conv(64,64) conv(64,64) conv(64,64) pool 1x3 pool 1x2 pool 1x2 pool 1x2 128 conv(64, 128) conv(64, 128) conv(64, 128) conv(64, 128) conv(128, 128) conv(128, 128) conv(128, 128) conv(128, 128) pool 2x2 pool 2x2 pool 1x2 pool 1x2 256 conv(128, 256) conv(128, 256) conv(128, 256) conv(256, 256) conv(256, 256) conv(256, 256) conv(256, 256) pool 1x2 pool 2x2 pool 2x2 512 conv9x9(3,512) conv(256, 512) conv(256, 512) pool 1x3 conv(512, 512) conv(512, 512) conv3x4(512,512) conv(512, 512) pool 2x2 pool 2x2 FC 2048 FC 2048 (FC 2048) FC output size Softmax 28 / 72 KUR TOK CEB KAZ TEL LIT Softmax Softmax Softmax Softmax Softmax Softmax FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC pool conv Shared conv pool conv conv Context +/-5 Context +/-10, stride 2 Context +/- 20, stride 4

VGG KUR TOK CEB KAZ TEL LIT Softmax Softmax Softmax Softmax Softmax Softmax FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC pool conv Shared conv pool conv conv Context +/-5 29 / 72 Context +/-10, stride 2 Context +/- 20, stride 4

VGG Performance WER # params (M) #M frames Classic 512 [17] 13.2 41.2 1200 Classic 256 ReLU (A+S) 13.8 58.7 290 VCX (6 conv) (A+S) 13.1 36.9 290 VDX (8 conv) (A+S) 12.3 38.4 170 WDX (10 conv) (A+S) 12.2 41.3 140 VDX (8 conv) (S) 11.9 38.4 340 WDX (10 conv) (S) 11.8 41.3 320 30 / 72

Recurrent Neural Networks (RNNs) DNNs are deep in layers. RNNs are deep in time (in addition). Shared weights and biases across time steps. 32 / 72

Unfolded RNN 33 / 72

DNN vs. RNN 34 / 72

Lecture 14 Advanced Neural Networks Michael Picheny, Bhuvana - PowerPoint PPT Presentation

Lecture 14 Advanced Neural Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 27

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

SATELLITE COMMUNICATIONS SATELLITE COMMUNICATIONS DATA LINK SOLUTION FOR LONG TERM AIR TRAFFIC

ADVANCED ALGORITHMS Lecture 20: Linear Programming 1 ANNOUNCEMENTS HW 4 is due on Monday,

Measurements of aerosol optical properties at the Station for Measuring EcosystemAtmosphere

Trigger Validation meeting Egamma slice G.Lerner, R. White July 22, 2015 Slide 1/4 July 22,

A Framework for Monitoring Agent- Based Normative Systems * Sanjay Modgil, Noura Faci, Felipe

Semantic Parsing: Past, Present, and Future Raymond J. Mooney Dept. of Computer Science

Sparse Solutions of Underdetermined Linear Equations by Linear Programming David Donoho &

DI MA- -Maude: Maude: DI MA Toward a Formal Framework Toward a Formal Framework for

Sambuz

Useful Links

Newsletter

Mail Us

Lecture 14 Advanced Neural Networks Michael Picheny, Bhuvana - PowerPoint PPT Presentation

Lecture 14 Advanced Neural Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 27

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

SATELLITE COMMUNICATIONS SATELLITE COMMUNICATIONS DATA LINK SOLUTION FOR LONG TERM AIR TRAFFIC

ADVANCED ALGORITHMS Lecture 20: Linear Programming 1 ANNOUNCEMENTS HW 4 is due on Monday,

Measurements of aerosol optical properties at the Station for Measuring EcosystemAtmosphere

Trigger Validation meeting Egamma slice G.Lerner, R. White July 22, 2015 Slide 1/4 July 22,

A Framework for Monitoring Agent- Based Normative Systems * Sanjay Modgil, Noura Faci, Felipe

Semantic Parsing: Past, Present, and Future Raymond J. Mooney Dept. of Computer Science

Sparse Solutions of Underdetermined Linear Equations by Linear Programming David Donoho &amp;

DI MA- -Maude: Maude: DI MA Toward a Formal Framework Toward a Formal Framework for

Sambuz

Useful Links

Newsletter

Mail Us

Sparse Solutions of Underdetermined Linear Equations by Linear Programming David Donoho &