CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] - PowerPoint PPT Presentation

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of Waterloo CS480/680 Spring 2019 Pascal Poupart 1

Outline • Deep Neural Networks – Gradient Vanishing • Rectified linear units – Overfitting • Dropout • Breakthroughs – Acoustic modeling in speech recognition – Image recognition University of Waterloo CS480/680 Spring 2019 Pascal Poupart 2

Deep Neural Networks • Definition: neural network with many hidden layers • Advantage: high expressivity • Challenges: – How should we train a deep neural network? – How can we avoid overfitting? University of Waterloo CS480/680 Spring 2019 Pascal Poupart 3

Expressiveness • Neural networks with one hidden layer of sigmoid/hyperbolic units can approximate arbitrarily closely neural networks with several layers of sigmoid/hyperbolic units • However as we increase the number of layers, the number of units needed may decrease exponentially (with the number of layers) University of Waterloo CS480/680 Spring 2019 Pascal Poupart 4

Example – Parity Function • Single layer of hidden nodes = 0 1 23 %$$ %& −1 23 565# 7 = 1 7 = −1 2 %&! odd "#$ "#$ "#$ "#$ "#$ "#$ "#$ "#$ subsets ! ! ! " ! # ! $ # inputs University of Waterloo CS480/680 Spring 2019 Pascal Poupart 5

Example – Parity Function • 2" − 2 layers of hidden nodes 2 odd 2 odd 2 odd subsets subsets subsets ! ! "#$ %& "#$ %& "#$ %& = ( 1 *+ %$$ −1 *+ -.-# ! " "#$ ! # ! $ "#$ "#$ / = 1 / = −1 University of Waterloo CS480/680 Spring 2019 Pascal Poupart 6

The power of depth (practice) • Challenge: how to train deep NNs? University of Waterloo CS480/680 Spring 2019 Pascal Poupart 7

Speech • 2006 (Hinton, al.): first effective algo for deep NN – layerwise training of Stacked Restricted Boltzmann Machines (SRBM)s • 2009: Breakthrough in acoustic modeling – replace Gaussian Mixture Models by SRBMs – Improved speech recognition at Google,Microsoft,IBM • 2013-today: recurrent neural nets (LSTM) – Google error rate: 23% (2013) à 8% (2015) – Microsoft error rate: 5.9% (Oct 17, 2016) same as human performance University of Waterloo CS480/680 Spring 2019 Pascal Poupart 8

Image Classification • ImageNet Large Scale Visual Recognition Challenge Features + SVMs Deep Convolutional Neural Nets 28.2 30 25.8 Classification error (%) 5 8 19 22 152 depth 25 20 16.4 15 11.7 10 7.3 6.7 5.1 3.57 3.07 5 0 NEC (2010) XRCE (2011) AlexNet (2012) ZF (2013) VGG (2014) GoogleLeNet (2014) ResNet (2015) GoogleLeNet-v4 (2016) Human University of Waterloo CS480/680 Spring 2019 Pascal Poupart 9

Vanishing Gradients • Deep neural networks of sigmoid and hyperbolic units often suffer from vanishing gradients medium large small gradient gradient gradient University of Waterloo CS480/680 Spring 2019 Pascal Poupart 10

Sigmoid and hyperbolic units • Derivative is always less than 1 sigmoid hyperbolic University of Waterloo CS480/680 Spring 2019 Pascal Poupart 11

Simple Example ! = # $ % # $ & # $ ' # $ ( ) • $ + $ ) $ ' $ & ) ℎ ) ℎ ' ! ℎ + Common weight initialization in (-1,1) • Sigmoid function and its derivative always less than 1 • This leads to vanishing gradients: • !# $ = # % * & # * ' !" As products of !# ( = # % * & $ & # % * ' # * ) !" factors less than 1 gets longer, !# * = # % * & $ & # % * ' $ ' # % * ) # * + !" gradient vanishes !# , = # % * & $ & # % * ' $ ' # % * ) $ ) #′ * + ) !" University of Waterloo CS480/680 Spring 2019 Pascal Poupart 12

Avoiding Vanishing Gradients • Several popular solutions: – Pre-training – Rectified linear units and maxout units – Skip connections – Batch normalization University of Waterloo CS480/680 Spring 2019 Pascal Poupart 13

Rectified Linear Units • Rectified linear: ℎ " = max(0, ") – Gradient is 0 or 1 – Sparse computation • Soft version (“Softplus”) : ℎ " = log(1 + 0 ! ) Softplus Rectified Linear • Warning: softplus does not prevent gradient vanishing (gradient < 1) University of Waterloo CS480/680 Spring 2019 Pascal Poupart 14

Maxout Units • Generalization of rectified linear units " # ! , ∑ ! % ! # # ! , ∑ ! % ! $ # ! , … ∑ ! % ! !"# max identity identity identity ! ( ! ) ! + ! * University of Waterloo CS480/680 Spring 2019 Pascal Poupart 15

Overfitting • High expressivity increases the risk of overfitting – # of parameters is often larger than the amount of data • Some solutions: – Regularization – Dropout – Data augmentation University of Waterloo CS480/680 Spring 2019 Pascal Poupart 16

Dropout • Idea: randomly “drop” some units from the network when training • Training: at each iteration of gradient descent – Each input unit is dropped with probability ! ! (e.g., 0.2) – Each hidden unit is dropped with probability ! " (e.g., 0.5) • Prediction (testing): – Multiply each input unit by 1 − ! ! – Multiply each hidden unit by 1 − ! " University of Waterloo CS480/680 Spring 2019 Pascal Poupart 17

Dropout Algorithm Training: let ⨀ denote elementwise multiplication Repeat • – For each training example (# ! , % ! ) do ()) from *+,-./001 1 − 4 ) 5 ! for 1 ≤ 0 ≤ 7 • Sample ' ( • Neural network with dropout applied: % $ # ! # ! , ' ! ; : = ℎ " : # … ℎ $ : $ ℎ % : % 8 > # ! ⨀' ! ⨀ ' ! … ⨀ ' ! • Loss: ?,,(% ( , 8 ( (# ( , ' ( ; :) DEFF • Update: @ 5A ← @ 5A − C DG "# – End for Until convergence • Prediction: 8 # ! ; : = ℎ " : # … ℎ $ : $ ℎ % : % > # ! (1 − 4 % 1 − 4 $ … (1 − 4 # ) University of Waterloo CS480/680 Spring 2019 Pascal Poupart 18

Intuition • Dropout can be viewed as an approximate form of ensemble learning • In each training iteration, a different subnetwork is trained • At test time, these subnetworks are “merged” by averaging their weights University of Waterloo CS480/680 Spring 2019 Pascal Poupart 19

Applications of Deep Neural Networks • Speech Recognition • Image recognition • Machine translation • Control • Any application of shallow neural networks University of Waterloo CS480/680 Spring 2019 Pascal Poupart 20

Acoustic Modeling in Speech Recognition University of Waterloo CS480/680 Spring 2019 Pascal Poupart 21

Acoustic Modeling in Speech Recognition University of Waterloo CS480/680 Spring 2019 Pascal Poupart 22

Image Recognition • Convolutional Neural Network – With rectified linear units and dropout – Data augmentation for transformation invariance University of Waterloo CS480/680 Spring 2019 Pascal Poupart 23

ImageNet Breakthrough • Results: ILSVRC-2012 • From Krizhevsky, Sutskever, Hinton University of Waterloo CS480/680 Spring 2019 Pascal Poupart 24

ImageNet Breakthrough • From Krizhevsky, Sutskever, Hinton University of Waterloo CS480/680 Spring 2019 Pascal Poupart 25

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] - PowerPoint PPT Presentation

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of Waterloo CS480/680 Spring 2019 Pascal Poupart 1 Outline Deep Neural Networks Gradient Vanishing Rectified linear units Overfitting

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

CS480/680 Lecture 9: June 5, 2019 Perceptrons, Neural Networks [D] Chapt. 4, [HTF] Chapt. 11,

CS480/680 Lecture 12: June 17, 2019 Gaussian Processes [B] Section 6.4 [M] Chap. 15 [HTF] Sec.

CS480/680 Lecture 8: June 3, 2019 Classification by Logistic Regression, Generalized linear

CS480/680 Lecture 11: June 12, 2019 Kernel methods [D] Chap. 11 [B] Sec. 6.1, 6.2 [M] Sec.

CS480/680 Lecture 14: June 24, 2019 Support Vector Machines (continued) [B] Sec. 7.1 [D] Sec.

CS480/680 Lecture 18: July 8, 2019 Recurrent and Recursive Neural Networks [GBC] Chap. 10

CS480/680 Lecture 4: May 15, 2019 Statistical Learning [RN]: Sec 20.1, 20.2, [M]: Sec. 2.2, 3.2

CS480/680 Machine Learning Lecture 1: May 6 th , 2019 Course Introduction Pascal Poupart

CS480/680 Lecture 24: July 29, 2019 Gradient Boosting, Bagging, Decision Forest [RN] Sec. 18.10,

CS480/680 Lecture 22: July 22, 2019 Ensemble Learning [RN] Sec. 18.10, [M] Sec. 16.2.5, [B]

CS480/680 Lecture 2: May 8 th , 2019 Nearest Neighbour [RN] Sec. 18.8.1, [HTF] Sec. 2.3.2, [D]

CS480/680 Machine Learning Lecture 3: May 13, 2019 Linear Regression [RN] Sec. 18.6.1, [HTF]

CS480/680 Lecture 7: May 29, 2019 Classification with Mixture of Gaussians [B] Sections 4.2,

CS480/680 Lecture 19: July 10, 2019 Attention and Transformer Networks [Vaswani et al.,

CS480/680 Machine Learning Lecture 8: January 30 th , 2020 Graphical Models Zahra Sheikhbahaee

Sigmoid: ATwistedTaleofFluxandFields ByTylerBehm

Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron A

Machine Learning - MT 2016 8. Classification: Logistic Regression Varun Kanade University of

Linear and Logistic Regression Yingyu Liang Computer Sciences 760 Fall 2017

Dense layers IN TRODUCTION TO TEN S ORF LOW IN P YTH ON Isaiah Hull Economist The linear

JACKPOT: Online Experimentation of Cloud Microservices BY M. TOSLALI 1 , S. PARTHASARATHY 2 , F.

Activation Functions Activation Functions In [1]: % matplotlib inline import d2l from mxnet

Deep Learning T HEORY , H ISTORY , S TATE OF THE A RT & P RACTICAL T OOLS by Ilya Kuzovkin

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] - PowerPoint PPT Presentation

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of Waterloo CS480/680 Spring 2019 Pascal Poupart 1 Outline Deep Neural Networks Gradient Vanishing Rectified linear units Overfitting

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

CS480/680 Lecture 9: June 5, 2019 Perceptrons, Neural Networks [D] Chapt. 4, [HTF] Chapt. 11,

CS480/680 Lecture 12: June 17, 2019 Gaussian Processes [B] Section 6.4 [M] Chap. 15 [HTF] Sec.

CS480/680 Lecture 8: June 3, 2019 Classification by Logistic Regression, Generalized linear

CS480/680 Lecture 11: June 12, 2019 Kernel methods [D] Chap. 11 [B] Sec. 6.1, 6.2 [M] Sec.

CS480/680 Lecture 14: June 24, 2019 Support Vector Machines (continued) [B] Sec. 7.1 [D] Sec.

CS480/680 Lecture 18: July 8, 2019 Recurrent and Recursive Neural Networks [GBC] Chap. 10

CS480/680 Lecture 4: May 15, 2019 Statistical Learning [RN]: Sec 20.1, 20.2, [M]: Sec. 2.2, 3.2

CS480/680 Machine Learning Lecture 1: May 6 th , 2019 Course Introduction Pascal Poupart

CS480/680 Lecture 24: July 29, 2019 Gradient Boosting, Bagging, Decision Forest [RN] Sec. 18.10,

CS480/680 Lecture 22: July 22, 2019 Ensemble Learning [RN] Sec. 18.10, [M] Sec. 16.2.5, [B]

CS480/680 Lecture 2: May 8 th , 2019 Nearest Neighbour [RN] Sec. 18.8.1, [HTF] Sec. 2.3.2, [D]

CS480/680 Machine Learning Lecture 3: May 13, 2019 Linear Regression [RN] Sec. 18.6.1, [HTF]

CS480/680 Lecture 7: May 29, 2019 Classification with Mixture of Gaussians [B] Sections 4.2,

CS480/680 Lecture 19: July 10, 2019 Attention and Transformer Networks [Vaswani et al.,

CS480/680 Machine Learning Lecture 8: January 30 th , 2020 Graphical Models Zahra Sheikhbahaee

Sigmoid: ATwistedTaleofFluxandFields ByTylerBehm

Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron A

Machine Learning - MT 2016 8. Classification: Logistic Regression Varun Kanade University of

Linear and Logistic Regression Yingyu Liang Computer Sciences 760 Fall 2017

Dense layers IN TRODUCTION TO TEN S ORF LOW IN P YTH ON Isaiah Hull Economist The linear

JACKPOT: Online Experimentation of Cloud Microservices BY M. TOSLALI 1 , S. PARTHASARATHY 2 , F.

Activation Functions Activation Functions In [1]: % matplotlib inline import d2l from mxnet

Deep Learning T HEORY , H ISTORY , S TATE OF THE A RT &amp; P RACTICAL T OOLS by Ilya Kuzovkin

Deep Learning T HEORY , H ISTORY , S TATE OF THE A RT & P RACTICAL T OOLS by Ilya Kuzovkin