neural net backpropagation
play

Neural Net Backpropagation 3/20/17 Recall: Limitations of - PowerPoint PPT Presentation

Neural Net Backpropagation 3/20/17 Recall: Limitations of Perceptrons vs. AND and OR are linearly separable. XOR isnt What is the output of the network? ( 0 x < 0 f ( x ) = 1 x 0 1 f ( x ) = 1 + e x ( 0 x < 0 f ( x


  1. Neural Net Backpropagation 3/20/17

  2. Recall: Limitations of Perceptrons vs. • AND and OR are linearly separable. XOR isn’t

  3. What is the output of the network? ( 0 x < 0 f ( x ) = 1 x ≥ 0 1 f ( x ) = 1 + e − x ( 0 x < 0 f ( x ) = x ≥ 0 x

  4. How can we train these networks? Two reasons the perceptron algorithm won’t work: 1. Non-threshold activation functions. 2. Multiple layers (what’s the correction for hidden nodes?). Key idea: stochastic gradient descent (SGD). • Compute the error on a random training example. • Compute the derivative of the error with respect to each weight. • Update weights in the direction that reduces error.

  5. Problem: SGD on threshold functions • The derivative of this function is always 0. • We can’t “move in the direction of the gradient”.

  6. Better Activation Functions sigmoid tanh 1 tanh( x ) = 1 + e − 2 x RELU σ ( x ) = 1 + e − x 1 − e − 2 x ( 0 x < 0 RELU( x ) = x ≥ 0 x

  7. Derivatives of Activation Functions sigmoid tanh 1 tanh( x ) = 1 + e − 2 x σ ( x ) = 1 + e − x 1 − e − 2 x d σ ( x ) d tanh( x ) = 1 − tanh 2 ( x ) = σ ( x )(1 − σ ( x )) dx dx RELU ( 0 x < 0 RELU( x ) = x ≥ 0 x ( d RELU( x ) 0 x ≤ 0 = dx 1 x > 0

  8. Error Gradient • Define training error as squared difference between a node’s output and the target: x ) = ( t − o ) 2 E ( ~ w, ~ • Compute gradient of error with respect to weights: sigmoid ∂ E 1 X w · ~ ~ x = w i x i o = 1 + e − ~ w · ~ ∂ w i x i … … … algebra ensues … … … ∂ E = − o (1 − o )( t − o ) x i ∂ w i

  9. Output Node Gradient Descent Step sigmoid α = . 5 w i + = − α ∂ E w i + = α ( o )(1 − o )( t − o ) x i ∂ w i w 0 += . 5 · . 7(1 − . 7)( . 9 − . 7)2 → w i = 1 . 04 w 1 += . 5 · . 7(1 − . 7)( . 9 − . 7)1 . 2 → w i = − . 97

  10. What about hidden layers? • Use the chain rule to compute error derivatives for previous layers. • This turns out to be much easier than it sounds. Let 𝜀 k be the error we computed for output-node k . sigmoid δ k = o k (1 − o k )( t k − o k ) The error for hidden node h comes from the sum of its contribution to the errors for each output node. X w hk δ k k ∈ output

  11. Hidden Node Gradient Descent Step • Compute the contribution to next-layer errors: X δ h = o h (1 − o h ) w hk δ k k ∈ next layer • Update incoming weights using 𝜀 h as the error: w i + = αδ h x i

  12. Backpropagation Algorithm for 1:training runs for example in shuffled training data: run example through network compute error for each output node for each layer (starting from output): for each node in layer: gradient descent update on incoming weights

  13. Example Backpropagation Update 1 σ ( x ) = w i + = α ( o )(1 − o )( t − o ) x i 1 + e − x X δ h = o h (1 − o h ) w hk δ k k ∈ next layer

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend