Neural Net Backpropagation 3/20/17 Recall: Limitations of - - PowerPoint PPT Presentation

neural net backpropagation
SMART_READER_LITE
LIVE PREVIEW

Neural Net Backpropagation 3/20/17 Recall: Limitations of - - PowerPoint PPT Presentation

Neural Net Backpropagation 3/20/17 Recall: Limitations of Perceptrons vs. AND and OR are linearly separable. XOR isnt What is the output of the network? ( 0 x < 0 f ( x ) = 1 x 0 1 f ( x ) = 1 + e x ( 0 x < 0 f ( x


slide-1
SLIDE 1

Neural Net Backpropagation

3/20/17

slide-2
SLIDE 2

Recall: Limitations of Perceptrons

  • vs.

AND and OR are linearly separable. XOR isn’t

slide-3
SLIDE 3

What is the output of the network?

f(x) = ( x < 0 1 x ≥ 0 f(x) = 1 1 + e−x f(x) = ( x < 0 x x ≥ 0

slide-4
SLIDE 4

How can we train these networks?

Two reasons the perceptron algorithm won’t work:

  • 1. Non-threshold activation functions.
  • 2. Multiple layers (what’s the correction for hidden

nodes?). Key idea: stochastic gradient descent (SGD).

  • Compute the error on a random training example.
  • Compute the derivative of the error with respect to

each weight.

  • Update weights in the direction that reduces error.
slide-5
SLIDE 5

Problem: SGD on threshold functions

  • The derivative of this function is always 0.
  • We can’t “move in the direction of the gradient”.
slide-6
SLIDE 6

Better Activation Functions

sigmoid tanh RELU

RELU(x) = ( x < 0 x x ≥ 0 σ(x) = 1 1 + e−x tanh(x) = 1 + e−2x 1 − e−2x

slide-7
SLIDE 7

Derivatives of Activation Functions

σ(x) = 1 1 + e−x dσ(x) dx = σ(x)(1 − σ(x)) d tanh(x) dx = 1 − tanh2(x) tanh(x) = 1 + e−2x 1 − e−2x RELU(x) = ( x < 0 x x ≥ 0

sigmoid tanh RELU

dRELU(x) dx = ( x ≤ 0 1 x > 0

slide-8
SLIDE 8

Error Gradient

  • Define training error as squared difference

between a node’s output and the target:

  • Compute gradient of error with respect to weights:

… … … algebra ensues … … …

E(~ w, ~ x) = (t − o)2 ∂E ∂wi

  • =

1 1 + e− ~

w·~ x

~ w · ~ x = X

i

wixi

sigmoid

∂E ∂wi = −o(1 − o)(t − o)xi

slide-9
SLIDE 9

Output Node Gradient Descent Step

wi+ = −α ∂E ∂wi wi+ = α(o)(1 − o)(t − o)xi

sigmoid

w0 += .5 · .7(1 − .7)(.9 − .7)2 → wi = 1.04 w1 += .5 · .7(1 − .7)(.9 − .7)1.2 → wi = −.97 α = .5

slide-10
SLIDE 10

What about hidden layers?

  • Use the chain rule to compute error derivatives for

previous layers.

  • This turns out to be much easier than it sounds.

Let 𝜀k be the error we computed for output-node k. The error for hidden node h comes from the sum of its contribution to the errors for each output node.

sigmoid

δk = ok(1 − ok)(tk − ok) X

k∈output

whkδk

slide-11
SLIDE 11

Hidden Node Gradient Descent Step

  • Compute the contribution to next-layer errors:
  • Update incoming weights using 𝜀h as the error:

δh = oh(1 − oh) X

k∈next layer

whkδk wi+ = αδhxi

slide-12
SLIDE 12

Backpropagation Algorithm

for 1:training runs for example in shuffled training data: run example through network compute error for each output node for each layer (starting from output): for each node in layer: gradient descent update on incoming weights

slide-13
SLIDE 13

Example Backpropagation Update

wi+ = α(o)(1 − o)(t − o)xi δh = oh(1 − oh) X

k∈next layer

whkδk

σ(x) = 1 1 + e−x