CS885 Reinforcement Learning Lecture 4a: May 11, 2018 Deep Neural - - PowerPoint PPT Presentation

cs885 reinforcement learning lecture 4a may 11 2018
SMART_READER_LITE
LIVE PREVIEW

CS885 Reinforcement Learning Lecture 4a: May 11, 2018 Deep Neural - - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 4a: May 11, 2018 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Quick recap Markov Decision Processes: value iteration ( " + * ,- Pr "


slide-1
SLIDE 1

CS885 Reinforcement Learning Lecture 4a: May 11, 2018

Deep Neural Networks [GBC] Chap. 6, 7, 8

CS885 Spring 2018 Pascal Poupart 1 University of Waterloo

slide-2
SLIDE 2

CS885 Spring 2018 Pascal Poupart 2

Quick recap

  • Markov Decision Processes: value iteration

! " ← max

'

( " + * ∑,- Pr "- ", 1 !("-)

  • Reinforcement Learning: Q-Learning

4 ", 1 ← 4 ", 1 + 5[7 + * max

'8 4 "-, 1- − 4(", 1)]

  • Complexity depends on number of states and actions

University of Waterloo

slide-3
SLIDE 3

CS885 Spring 2018 Pascal Poupart 3

Large State Spaces

  • Computer Go: 3"#$ states
  • Inverted pendulum: (&, &(, ), )()

– 4-dimensional continuous state space

  • Atari: 210x160x3 dimensions (pixel values)

University of Waterloo

slide-4
SLIDE 4

CS885 Spring 2018 Pascal Poupart 4

Functions to be Approximated

  • Policy: ! " → $
  • Q-function: % ", $ ∈ ℜ
  • Value function: ) " ∈ ℜ

University of Waterloo

slide-5
SLIDE 5

CS885 Spring 2018 Pascal Poupart 5

Q-function Approximation

  • Let ! = #$, #&, … , #( )
  • Linear

* !, + ≈ ∑. /0.#.

  • Non-linear (e.g., neural network)

* !, + ≈ 1(3; 5)

University of Waterloo

slide-6
SLIDE 6

Traditional Neural Network

  • Network of units

(computational neurons) linked by weighted edges

  • Each unit computes:

z = ℎ(%&' + ))

– Inputs: ' – Output: + – Weights (parameters): % – Bias: ) – Activation function (usually non-linear): ℎ

CS885 Spring 2018 Pascal Poupart 6 University of Waterloo

slide-7
SLIDE 7

One hidden Layer Architecture

  • Feed-forward neural network
  • Hidden units: !

" = ℎ%('" % ( + * " (%))

  • Output units: ,- = ℎ.('-

. / + *- (.))

  • Overall: ,- = ℎ. ∑" 1-"

. ℎ% ∑2 1 "2 % 32 + * " (%) + *- (.)

CS885 Spring 2018 Pascal Poupart 7 University of Waterloo

3% 3. !% !. ,% 1%%

(%)

1%.

(%)

1.%

(%)

1..

(%)

1%%

(.)

1%.

(.)

input hidden

  • utput

1 1 *%

(%)

*.

(%)

*%

(.)

slide-8
SLIDE 8

Traditional activation functions ℎ

  • Threshold: ℎ " = $ 1

" ≥ 0 −1 " < 0

  • Sigmoid: ℎ " = * " =

+ +,-./

  • Gaussian: ℎ " = 012

3 /.4 5 3

  • Tanh: ℎ " = tanh " =
  • /1-./
  • /,-./
  • Identity: ℎ " = "

CS885 Spring 2018 Pascal Poupart 8 University of Waterloo

slide-9
SLIDE 9

Universal function approximation

  • Theorem: Neural networks with at least one hidden

layer of sufficiently many sigmoid/tanh/Gaussian units can approximate any function arbitrarily closely.

  • Picture:

CS885 Spring 2018 Pascal Poupart 9 University of Waterloo

slide-10
SLIDE 10

Minimize least squared error

  • Minimize error function

! " = 1 2 &

'

!' " ( = 1 2 &

'

) *+, " − .'

( (

where ) is the function encoded by the neural net

  • Train by gradient descent (a.k.a. backpropagation)

– For each example (*', .'), adjust the weights as follows:

1

23 ← 1 23 − 5 6!'

61

23

CS885 Spring 2018 Pascal Poupart 10 University of Waterloo

slide-11
SLIDE 11

Deep Neural Networks

  • Definition: neural network with many hidden layers
  • Advantage: high expressivity
  • Challenges:

– How should we train a deep neural network? – How can we avoid overfitting?

CS885 Spring 2018 Pascal Poupart 11 University of Waterloo

slide-12
SLIDE 12

Mixture of Gaussians

  • Deep neural network

(hierarchical mixture)

  • Shallow neural network

(flat mixture)

University of Waterloo CS885 Spring 2018 Pascal Poupart 12

slide-13
SLIDE 13

13

Image Classification

  • ImageNet Large Scale Visual Recognition Challenge

28.2 25.8 16.4 11.7 7.3 6.7 3.57 3.07 5.1 5 10 15 20 25 30 N E C ( 2 1 ) X R C E ( 2 1 1 ) A l e x N e t ( 2 1 2 ) Z F ( 2 1 3 ) V G G ( 2 1 4 ) G

  • g

l e L e N e t ( 2 1 4 ) R e s N e t ( 2 1 5 ) G

  • g

l e L e N e t

  • v

4 ( 2 1 6 ) H u m a n Classification error (%)

Features + SVMs Deep Convolutional Neural Nets 5 8 19 22 152

depth

CS885 Spring 2018 Pascal Poupart University of Waterloo

slide-14
SLIDE 14

CS885 Spring 2018 Pascal Poupart 14

Vanishing Gradients

  • Deep neural networks of sigmoid and

hyperbolic units often suffer from vanishing gradients

large gradient medium gradient small gradient

University of Waterloo

slide-15
SLIDE 15

CS885 Spring 2018 Pascal Poupart 15

Sigmoid and hyperbolic units

  • Derivative is always less than 1

sigmoid hyperbolic

University of Waterloo

slide-16
SLIDE 16

CS885 Spring 2018 Pascal Poupart 16

Simple Example

  • ! = # $% # $& # $' # $( )
  • Common weight initialization in (-1,1)
  • Sigmoid function and its derivative always less than 1
  • This leads to vanishing gradients:

*+ *,- = #′(0%)# 0& *+ *,2 = #3 0% $%#′(0&)# 0' ≤ *+ *,- *+ *,5 = #3 0% $%#′(0&)$&#′(0')#(0() ≤ *+ *,2 *+ *,6 = #3 0% $%#3 0& $&#3 0' $'#′ 0( ) ≤ *+ *,5

) ℎ( ℎ' ℎ& ! $( $' $& $%

University of Waterloo

slide-17
SLIDE 17

CS885 Spring 2018 Pascal Poupart 17

Mitigating Vanishing Gradients

  • Some popular solutions:

– Pre-training – Rectified linear units – Batch normalization – Skip connections

University of Waterloo

slide-18
SLIDE 18

CS885 Spring 2018 Pascal Poupart 18

Rectified Linear Units

  • Rectified linear: ℎ " = max(0, ")

– Gradient is 0 or 1 – Sparse computation

  • Soft version

(“Softplus”) : ℎ " = log(1 + 01)

  • Warning: softplus

does not prevent gradient vanishing (gradient < 1)

Rectified Linear Softplus

University of Waterloo