Neural Networks II Chen Gao Virginia Tech Spring 2019 ECE-5424G / - - PowerPoint PPT Presentation

β–Ά
neural networks ii
SMART_READER_LITE
LIVE PREVIEW

Neural Networks II Chen Gao Virginia Tech Spring 2019 ECE-5424G / - - PowerPoint PPT Presentation

Neural Networks II Chen Gao Virginia Tech Spring 2019 ECE-5424G / CS-5824 Neural Networks Origins: Algorithms that try to mimic the brain. What is this? A single neuron in the brain Input Output Slide credit: Andrew Ng An artificial


slide-1
SLIDE 1

Neural Networks II

Chen Gao Virginia Tech

Spring 2019

ECE-5424G / CS-5824

slide-2
SLIDE 2

Neural Networks

  • Origins: Algorithms that try to mimic the brain.

What is this?

slide-3
SLIDE 3

A single neuron in the brain

Input Output

Slide credit: Andrew Ng

slide-4
SLIDE 4

An artificial neuron: Logistic unit

𝑦 = 𝑦0 𝑦1 𝑦2 𝑦3 πœ„ = πœ„0 πœ„1 πœ„2 πœ„3

  • Sigmoid (logistic) activation function

𝑦1 𝑦2 𝑦3 𝑦0

β„Žπœ„ 𝑦 = 1 1 + π‘“βˆ’πœ„βŠ€π‘¦

β€œBias unit” β€œInput” β€œOutput” β€œWeights” β€œParameters”

Slide credit: Andrew Ng

slide-5
SLIDE 5

Visualization of weights, bias, activation function

bias b only change the position of the hyperplane

Slide credit: Hugo Larochelle

range determined by g(.)

slide-6
SLIDE 6

Activation - sigmoid

  • Squashes the neuron’s pre-

activation between 0 and 1

  • Always positive
  • Bounded
  • Strictly increasing

𝑕 𝑦 = 1 1 + π‘“βˆ’π‘¦

Slide credit: Hugo Larochelle

slide-7
SLIDE 7

Activation - hyperbolic tangent (tanh)

  • Squashes the neuron’s pre-

activation between -1 and 1

  • Can be positive or negative
  • Bounded
  • Strictly increasing

𝑕 𝑦 = tanh 𝑦 = 𝑓𝑦 βˆ’ π‘“βˆ’π‘¦ 𝑓𝑦 + π‘“βˆ’π‘¦

Slide credit: Hugo Larochelle

slide-8
SLIDE 8

Activation - rectified linear(relu)

  • Bounded below by 0
  • always non-negative
  • Not upper bounded
  • Tends to give neurons with

sparse activities

Slide credit: Hugo Larochelle

𝑕 𝑦 = relu 𝑦 = max 0, 𝑦

slide-9
SLIDE 9

Activation - softmax

  • For multi-class classification:
  • we need multiple outputs (1 output per class)
  • we would like to estimate the conditional probability π‘ž 𝑧 = 𝑑 | 𝑦
  • We use the softmax activation function at the output:

𝑕 𝑦 = softmax 𝑦 = 𝑓𝑦1 σ𝑑 𝑓𝑦𝑑 … 𝑓𝑦𝑑 σ𝑑 𝑓𝑦𝑑

Slide credit: Hugo Larochelle

slide-10
SLIDE 10

Universal approximation theorem

β€˜β€˜a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units’’ Hornik, 1991

Slide credit: Hugo Larochelle

slide-11
SLIDE 11

Neural network – Multilayer

𝑏1

(2)

𝑏2

(2)

𝑏3

(2)

𝑏0

(2)

β„ŽΞ˜ 𝑦

Layer 1 β€œOutput”

𝑦1 𝑦2 𝑦3 𝑦0

Layer 2 (hidden) Layer 3

Slide credit: Andrew Ng

slide-12
SLIDE 12

Neural network

𝑏1

(2)

𝑏2

(2)

𝑏3

(2)

𝑏0

(2)

β„ŽΞ˜ 𝑦

𝑦1 𝑦2 𝑦3 𝑦0

𝑏𝑗

(π‘˜) = β€œactivation” of unit 𝑗 in layer π‘˜

Θ π‘˜ = matrix of weights controlling function mapping from layer π‘˜ to layer π‘˜ + 1 𝑏1

(2) = 𝑕 Θ10 (1)𝑦0 + Θ11 (1)𝑦1 + Θ12 (1)𝑦2 + Θ13 (1)𝑦3

𝑏2

(2) = 𝑕 Θ20 (1)𝑦0 + Θ21 (1)𝑦1 + Θ22 (1)𝑦2 + Θ23 (1)𝑦3

𝑏3

(2) = 𝑕 Θ30 (1)𝑦0 + Θ31 (1)𝑦1 + Θ32 (1)𝑦2 + Θ33 (1)𝑦3

β„ŽΞ˜(𝑦) = 𝑕 Θ10

(2)𝑏0 (2) + Θ11 (1)𝑏1 (2) + Θ12 (1)𝑏2 (2) + Θ13 (1)𝑏3 (2)

𝑑

π‘˜ unit in layer π‘˜

𝑑

π‘˜+1 units in layer π‘˜ + 1

Size of Θ π‘˜ ?

𝑑

π‘˜+1 Γ— (𝑑 π‘˜ + 1)

Slide credit: Andrew Ng

slide-13
SLIDE 13

Neural network

𝑏1

(2)

𝑏2

(2)

𝑏3

(2)

𝑏0

(2)

β„ŽΞ˜ 𝑦

𝑦1 𝑦2 𝑦3 𝑦0

𝑦 = 𝑦0 𝑦1 𝑦2 𝑦3 z(2) = z1

(2)

z2

(2)

z3

(2)

𝑏1

(2) = 𝑕 Θ10 (1)𝑦0 + Θ11 (1)𝑦1 + Θ12 (1)𝑦2 + Θ13 (1)𝑦3

= 𝑕(z1

(2))

𝑏2

(2) = 𝑕 Θ20 (1)𝑦0 + Θ21 (1)𝑦1 + Θ22 (1)𝑦2 + Θ23 (1)𝑦3

= 𝑕(z2

(2))

𝑏3

(2) = 𝑕 Θ30 (1)𝑦0 + Θ31 (1)𝑦1 + Θ32 (1)𝑦2 + Θ33 (1)𝑦3

= 𝑕(z3

(2))

β„ŽΞ˜ 𝑦 = 𝑕 Θ10

2 𝑏0 2 + Θ11 1 𝑏1 2 + Θ12 1 𝑏2 2 + Θ13 1 𝑏3 2

= 𝑕(𝑨(3)) β€œPre-activation”

Slide credit: Andrew Ng

Why do we need g(.)?

slide-14
SLIDE 14

Neural network

𝑏1

(2)

𝑏2

(2)

𝑏3

(2)

𝑏0

(2)

β„ŽΞ˜ 𝑦

𝑦1 𝑦2 𝑦3 𝑦0

𝑦 = 𝑦0 𝑦1 𝑦2 𝑦3 z(2) = z1

(2)

z2

(2)

z3

(2)

𝑏1

(2) = 𝑕(z1 (2))

𝑏2

(2) = 𝑕(z2 (2))

𝑏3

(2) = 𝑕(z3 (2))

β„ŽΞ˜ 𝑦 = 𝑕(𝑨(3))

β€œPre-activation”

𝑨(2) = Θ(1)𝑦 = Θ(1)𝑏(1) 𝑏(2) = 𝑕(𝑨(2)) Add 𝑏0

(2) = 1

𝑨(3) = Θ(2)𝑏(2) β„ŽΞ˜ 𝑦 = 𝑏(3) = 𝑕(𝑨(3))

Slide credit: Andrew Ng

slide-15
SLIDE 15

Flow graph - Forward propagation

𝑏(2) 𝑨(2) 𝑏(3)

β„ŽΞ˜ 𝑦

X 𝑋(1) 𝑐(1) 𝑨(3) 𝑋(2) 𝑐(2)

𝑨(2) = Θ(1)𝑦 = Θ(1)𝑏(1) 𝑏(2) = 𝑕(𝑨(2)) Add 𝑏0

(2) = 1

𝑨(3) = Θ(2)𝑏(2) β„ŽΞ˜ 𝑦 = 𝑏(3) = 𝑕(𝑨(3))

How do we evaluate

  • ur prediction?
slide-16
SLIDE 16

Cost function

Logistic regression: Neural network:

Slide credit: Andrew Ng

slide-17
SLIDE 17

Gradient computation

Need to compute:

Slide credit: Andrew Ng

slide-18
SLIDE 18

Gradient computation

Given one training example 𝑦, 𝑧 𝑏(1) = 𝑦 𝑨(2) = Θ(1)𝑏(1) 𝑏(2) = 𝑕(𝑨(2)) (add a0

(2))

𝑨(3) = Θ(2)𝑏(2) 𝑏(3) = 𝑕(𝑨(3)) (add a0

(3))

𝑨(4) = Θ(3)𝑏(3) 𝑏(4) = 𝑕 𝑨 4 = β„ŽΞ˜ 𝑦

Slide credit: Andrew Ng

slide-19
SLIDE 19

Gradient computation: Backpropagation

Intuition: πœ€

π‘˜ (π‘š) = β€œerror” of node π‘˜ in layer π‘š

For each output unit (layer L = 4)

Slide credit: Andrew Ng

πœ€(4) = 𝑏(4) βˆ’ 𝑧 πœ€(3) = πœ€(4) πœ–πœ€(4)

πœ–π‘¨(3) = πœ€(4) πœ–πœ€(4) πœ–π‘(4) πœ–π‘(4) πœ–π‘¨(4) πœ–π‘¨(4) πœ–π‘(3) πœ–π‘(3) πœ–π‘¨(3)

= 1 *Θ 3 π‘ˆπœ€(4) .βˆ— 𝑕′ 𝑨 4 .βˆ— 𝑕′(𝑨(3))

𝑨(3) = Θ(2)𝑏(2) 𝑏(3) = 𝑕(𝑨(3)) 𝑨(4) = Θ(3)𝑏(3) 𝑏(4) = 𝑕 𝑨 4

slide-20
SLIDE 20

Backpropagation algorithm

Training set 𝑦(1), 𝑧(1) … 𝑦(𝑛), 𝑧(𝑛) Set Θ(1) = 0 For 𝑗 = 1 to 𝑛 Set 𝑏(1) = 𝑦 Perform forward propagation to compute 𝑏(π‘š) for π‘š = 2. . 𝑀 use 𝑧(𝑗) to compute πœ€(𝑀) = 𝑏(𝑀) βˆ’ 𝑧(𝑗) Compute πœ€(π‘€βˆ’1), πœ€(π‘€βˆ’2) … πœ€(2) Θ(π‘š) = Θ(π‘š) βˆ’ 𝑏(π‘š)πœ€(π‘š+1)

Slide credit: Andrew Ng

slide-21
SLIDE 21

Activation - sigmoid

  • Partial derivative

𝑕 𝑦 = 1 1 + π‘“βˆ’π‘¦

Slide credit: Hugo Larochelle

𝑕′ 𝑦 = 𝑕 𝑦 1 βˆ’ 𝑕 𝑦

slide-22
SLIDE 22

Activation - hyperbolic tangent (tanh)

𝑕 𝑦 = tanh 𝑦 = 𝑓𝑦 βˆ’ π‘“βˆ’π‘¦ 𝑓𝑦 + π‘“βˆ’π‘¦

Slide credit: Hugo Larochelle

  • Partial derivative

𝑕′ 𝑦 = 1 βˆ’ 𝑕 𝑦 2

slide-23
SLIDE 23

Activation - rectified linear(relu)

Slide credit: Hugo Larochelle

𝑕 𝑦 = relu 𝑦 = max 0, 𝑦

  • Partial derivative

𝑕′ 𝑦 = 1𝑦 > 0

slide-24
SLIDE 24

Initialization

  • For bias
  • Initialize all to 0
  • For weights
  • Can’t initialize all weights to the same value
  • we can show that all hidden units in a layer will always behave the same
  • need to break symmetry
  • Recipe: U[-b, b]
  • the idea is to sample around 0 but break symmetry

Slide credit: Hugo Larochelle

slide-25
SLIDE 25

Putting it together

Pick a network architecture

Slide credit: Hugo Larochelle

  • No. of input units: Dimension of features
  • No. output units: Number of classes
  • Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no.
  • f hidden units in every layer (usually the more the better)
  • Grid search
slide-26
SLIDE 26

Putting it together

Early stopping

Slide credit: Hugo Larochelle

  • Use a validation set performance to select the best configuration
  • To select the number of epochs, stop training when validation set error

increases

slide-27
SLIDE 27

Other tricks of the trade

Slide credit: Hugo Larochelle

  • Normalizing your (real-valued) data
  • Decaying the learning rate
  • as we get closer to the optimum, makes sense to take smaller update steps
  • mini-batch
  • can give a more accurate estimate of the risk gradient
  • Momentum
  • can use an exponential average of previous gradients
slide-28
SLIDE 28

Dropout

Slide credit: Hugo Larochelle

  • Idea: Β«crippleΒ» neural network by removing hidden units
  • each hidden unit is set to 0 with probability 0.5
  • hidden units cannot co-adapt to other units
  • hidden units must be more generally useful