CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] - - PowerPoint PPT Presentation

cs480 680 lecture 15 june 26 2019
SMART_READER_LITE
LIVE PREVIEW

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] - - PowerPoint PPT Presentation

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of Waterloo CS480/680 Spring 2019 Pascal Poupart 1 Outline Deep Neural Networks Gradient Vanishing Rectified linear units Overfitting


slide-1
SLIDE 1

CS480/680 Lecture 15: June 26, 2019

Deep Neural Networks [GBC] Chap. 6, 7, 8

CS480/680 Spring 2019 Pascal Poupart 1 University of Waterloo

slide-2
SLIDE 2

Outline

  • Deep Neural Networks

– Gradient Vanishing

  • Rectified linear units

– Overfitting

  • Dropout
  • Breakthroughs

– Acoustic modeling in speech recognition – Image recognition

CS480/680 Spring 2019 Pascal Poupart 2 University of Waterloo

slide-3
SLIDE 3

Deep Neural Networks

  • Definition: neural network with many hidden layers
  • Advantage: high expressivity
  • Challenges:

– How should we train a deep neural network? – How can we avoid overfitting?

CS480/680 Spring 2019 Pascal Poupart 3 University of Waterloo

slide-4
SLIDE 4

Expressiveness

  • Neural networks with one hidden layer of

sigmoid/hyperbolic units can approximate arbitrarily closely neural networks with several layers of sigmoid/hyperbolic units

  • However as we increase the number of layers, the

number of units needed may decrease exponentially (with the number of layers)

CS480/680 Spring 2019 Pascal Poupart 4 University of Waterloo

slide-5
SLIDE 5

CS480/680 Spring 2019 Pascal Poupart 5

Example – Parity Function

  • Single layer of hidden nodes

!! !" !# !$ "#$ "#$ "#$ "#$ "#$ "#$ "#$ "#$ %& # inputs 2%&! odd subsets = 0 1 23 %$$ −1 23 565# 7 = 1 7 = −1

University of Waterloo

slide-6
SLIDE 6

CS480/680 Spring 2019 Pascal Poupart 6

Example – Parity Function

  • 2" − 2 layers of hidden nodes

!! !" !# "#$ "#$ %& = ( 1 *+ %$$ −1 *+ -.-# "#$ "#$ !$ %& "#$ "#$ %& 2 odd subsets 2 odd subsets 2 odd subsets / = 1 / = −1

University of Waterloo

slide-7
SLIDE 7

7

The power of depth (practice)

  • Challenge: how to train deep NNs?

CS480/680 Spring 2019 Pascal Poupart University of Waterloo

slide-8
SLIDE 8

8

Speech

  • 2006 (Hinton, al.): first effective algo for deep NN

– layerwise training of Stacked Restricted Boltzmann Machines (SRBM)s

  • 2009: Breakthrough in acoustic modeling

– replace Gaussian Mixture Models by SRBMs – Improved speech recognition at Google,Microsoft,IBM

  • 2013-today: recurrent neural nets (LSTM)

– Google error rate: 23% (2013) à 8% (2015) – Microsoft error rate: 5.9% (Oct 17, 2016) same as human performance

CS480/680 Spring 2019 Pascal Poupart University of Waterloo

slide-9
SLIDE 9

9

Image Classification

  • ImageNet Large Scale Visual Recognition Challenge

28.2 25.8 16.4 11.7 7.3 6.7 3.57 3.07 5.1 5 10 15 20 25 30 NEC (2010) XRCE (2011) AlexNet (2012) ZF (2013) VGG (2014) GoogleLeNet (2014) ResNet (2015) GoogleLeNet-v4 (2016) Human Classification error (%)

Features + SVMs Deep Convolutional Neural Nets 5 8 19 22 152

depth

CS480/680 Spring 2019 Pascal Poupart University of Waterloo

slide-10
SLIDE 10

CS480/680 Spring 2019 Pascal Poupart 10

Vanishing Gradients

  • Deep neural networks of sigmoid and

hyperbolic units often suffer from vanishing gradients

large gradient medium gradient small gradient

University of Waterloo

slide-11
SLIDE 11

CS480/680 Spring 2019 Pascal Poupart 11

Sigmoid and hyperbolic units

  • Derivative is always less than 1

sigmoid hyperbolic

University of Waterloo

slide-12
SLIDE 12

CS480/680 Spring 2019 Pascal Poupart 12

Simple Example

  • ! = # $% # $& # $' # $( )
  • Common weight initialization in (-1,1)
  • Sigmoid function and its derivative always less than 1
  • This leads to vanishing gradients:

!" !#$ = #% *& # *' !" !#( = #% *& $&#% *' # *) !" !#* = #% *& $&#% *' $'#% *) # *+ !" !#, = #% *& $&#% *' $'#% *) $)#′ *+ )

) ℎ+ ℎ) ℎ' ! $+ $) $' $&

University of Waterloo

As products of factors less than 1 gets longer, gradient vanishes

slide-13
SLIDE 13

CS480/680 Spring 2019 Pascal Poupart 13

Avoiding Vanishing Gradients

  • Several popular solutions:

– Pre-training – Rectified linear units and maxout units – Skip connections – Batch normalization

University of Waterloo

slide-14
SLIDE 14

CS480/680 Spring 2019 Pascal Poupart 14

Rectified Linear Units

  • Rectified linear: ℎ " = max(0, ")

– Gradient is 0 or 1 – Sparse computation

  • Soft version

(“Softplus”) : ℎ " = log(1 + 0!)

  • Warning: softplus

does not prevent gradient vanishing (gradient < 1)

Rectified Linear Softplus

University of Waterloo

slide-15
SLIDE 15

CS480/680 Spring 2019 Pascal Poupart 15

Maxout Units

  • Generalization of rectified linear units

!"# ∑! %!

" #!, ∑! %! # #! , ∑! %! $ #!, …

max identity identity identity !( !) !* !+

University of Waterloo

slide-16
SLIDE 16

CS480/680 Spring 2019 Pascal Poupart 16

Overfitting

  • High expressivity increases the risk of
  • verfitting

– # of parameters is often larger than the amount of data

  • Some solutions:

– Regularization – Dropout – Data augmentation

University of Waterloo

slide-17
SLIDE 17

CS480/680 Spring 2019 Pascal Poupart 17

Dropout

  • Idea: randomly “drop” some units from the network

when training

  • Training: at each iteration of gradient descent

– Each input unit is dropped with probability !! (e.g., 0.2) – Each hidden unit is dropped with probability !" (e.g., 0.5)

  • Prediction (testing):

– Multiply each input unit by 1 − !! – Multiply each hidden unit by 1 − !"

University of Waterloo

slide-18
SLIDE 18

CS480/680 Spring 2019 Pascal Poupart 18

Dropout Algorithm

Training: let ⨀ denote elementwise multiplication

  • Repeat

– For each training example (#!, %!) do

  • Sample '(

()) from *+,-./001 1 − 4) 5! for 1 ≤ 0 ≤ 7

  • Neural network with dropout applied:

8

! #!, '!; : = ℎ" : #

… ℎ$ : $ ℎ% : % > #!⨀'!

%

⨀ '!

$

… ⨀ '!

#

  • Loss: ?,,(%(, 8

((#(, '(; :)

  • Update: @5A ← @5A − C

DEFF DG"#

– End for

  • Until convergence

Prediction:

8 #!; : = ℎ" : # … ℎ$ : $ ℎ% : % > #!(1 − 4% 1 − 4$ … (1 − 4#)

University of Waterloo

slide-19
SLIDE 19

CS480/680 Spring 2019 Pascal Poupart 19

Intuition

  • Dropout can be viewed as an approximate form
  • f ensemble learning
  • In each training iteration, a different

subnetwork is trained

  • At test time, these subnetworks are “merged”

by averaging their weights

University of Waterloo

slide-20
SLIDE 20

CS480/680 Spring 2019 Pascal Poupart 20

Applications of Deep Neural Networks

  • Speech Recognition
  • Image recognition
  • Machine translation
  • Control
  • Any application of shallow neural networks

University of Waterloo

slide-21
SLIDE 21

CS480/680 Spring 2019 Pascal Poupart 21

Acoustic Modeling in Speech Recognition

University of Waterloo

slide-22
SLIDE 22

CS480/680 Spring 2019 Pascal Poupart 22

Acoustic Modeling in Speech Recognition

University of Waterloo

slide-23
SLIDE 23

CS480/680 Spring 2019 Pascal Poupart 23

Image Recognition

  • Convolutional Neural Network

– With rectified linear units and dropout – Data augmentation for transformation invariance

University of Waterloo

slide-24
SLIDE 24

CS480/680 Spring 2019 Pascal Poupart 24

ImageNet Breakthrough

  • Results: ILSVRC-2012
  • From Krizhevsky, Sutskever, Hinton

University of Waterloo

slide-25
SLIDE 25

CS480/680 Spring 2019 Pascal Poupart 25

ImageNet Breakthrough

  • From Krizhevsky, Sutskever, Hinton

University of Waterloo