neural networks ii
play

Neural Networks II Chen Gao Virginia Tech Spring 2019 ECE-5424G / - PowerPoint PPT Presentation

Neural Networks II Chen Gao Virginia Tech Spring 2019 ECE-5424G / CS-5824 Neural Networks Origins: Algorithms that try to mimic the brain. What is this? A single neuron in the brain Input Output Slide credit: Andrew Ng An artificial


  1. Neural Networks II Chen Gao Virginia Tech Spring 2019 ECE-5424G / CS-5824

  2. Neural Networks β€’ Origins: Algorithms that try to mimic the brain. What is this?

  3. A single neuron in the brain Input Output Slide credit: Andrew Ng

  4. An artificial neuron: Logistic unit 𝑦 0 πœ„ 0 β€œBias unit” 𝑦 0 𝑦 1 πœ„ 1 𝑦 = πœ„ = 𝑦 2 πœ„ 2 β€œWeights” 𝑦 3 𝑦 1 πœ„ 3 β€œParameters” β€œOutput” 1 𝑦 2 β„Ž πœ„ 𝑦 = 1 + 𝑓 βˆ’πœ„ ⊀ 𝑦 𝑦 3 β€’ Sigmoid (logistic) activation function β€œInput” Slide credit: Andrew Ng

  5. Visualization of weights, bias, activation function range determined by g(.) bias b only change the position of the hyperplane Slide credit: Hugo Larochelle

  6. Activation - sigmoid β€’ Squashes the neuron’s pre - activation between 0 and 1 β€’ Always positive β€’ Bounded β€’ Strictly increasing 1 𝑕 𝑦 = 1 + 𝑓 βˆ’π‘¦ Slide credit: Hugo Larochelle

  7. Activation - hyperbolic tangent (tanh) β€’ Squashes the neuron’s pre - activation between -1 and 1 β€’ Can be positive or negative β€’ Bounded β€’ Strictly increasing 𝑕 𝑦 = tanh 𝑦 = 𝑓 𝑦 βˆ’ 𝑓 βˆ’π‘¦ 𝑓 𝑦 + 𝑓 βˆ’π‘¦ Slide credit: Hugo Larochelle

  8. Activation - rectified linear(relu) β€’ Bounded below by 0 β€’ always non-negative β€’ Not upper bounded β€’ Tends to give neurons with sparse activities 𝑕 𝑦 = relu 𝑦 = max 0, 𝑦 Slide credit: Hugo Larochelle

  9. Activation - softmax β€’ For multi-class classification: β€’ we need multiple outputs (1 output per class) β€’ we would like to estimate the conditional probability π‘ž 𝑧 = 𝑑 | 𝑦 β€’ We use the softmax activation function at the output: 𝑓 𝑦 1 𝑓 𝑦 𝑑 𝑕 𝑦 = softmax 𝑦 = Οƒ 𝑑 𝑓 𝑦 𝑑 … Οƒ 𝑑 𝑓 𝑦 𝑑 Slide credit: Hugo Larochelle

  10. Universal approximation theorem β€˜β€˜a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units ’’ Hornik, 1991 Slide credit: Hugo Larochelle

  11. Neural network – Multilayer (2) 𝑦 0 𝑏 0 (2) 𝑦 1 𝑏 1 β€œOutput” (2) 𝑦 2 β„Ž Θ 𝑦 𝑏 2 (2) 𝑦 3 𝑏 3 Layer 3 Layer 1 Layer 2 (hidden) Slide credit: Andrew Ng

  12. Neural network (π‘˜) = β€œactivation” of unit 𝑗 in layer π‘˜ 𝑏 𝑗 (2) 𝑦 0 𝑏 0 Θ π‘˜ = matrix of weights controlling (2) 𝑦 1 𝑏 1 function mapping from layer π‘˜ to layer π‘˜ + 1 (2) 𝑦 2 β„Ž Θ 𝑦 𝑏 2 𝑑 π‘˜ unit in layer π‘˜ (2) 𝑦 3 𝑏 3 𝑑 π‘˜+1 units in layer π‘˜ + 1 (2) = 𝑕 Θ 10 (1) 𝑦 0 + Θ 11 (1) 𝑦 1 + Θ 12 (1) 𝑦 2 + Θ 13 (1) 𝑦 3 𝑏 1 Size of Θ π‘˜ ? (2) = 𝑕 Θ 20 (1) 𝑦 0 + Θ 21 (1) 𝑦 1 + Θ 22 (1) 𝑦 2 + Θ 23 (1) 𝑦 3 𝑏 2 (2) = 𝑕 Θ 30 (1) 𝑦 0 + Θ 31 (1) 𝑦 1 + Θ 32 (1) 𝑦 2 + Θ 33 (1) 𝑦 3 𝑏 3 𝑑 π‘˜+1 Γ— (𝑑 π‘˜ + 1) (2) + Θ 11 (2) + Θ 12 (2) + Θ 13 (2) 𝑏 0 (1) 𝑏 1 (1) 𝑏 2 (1) 𝑏 3 (2) β„Ž Θ (𝑦) = 𝑕 Θ 10 Slide credit: Andrew Ng

  13. Neural network β€œPre - activation” 𝑦 0 (2) z 1 (2) 𝑦 0 𝑏 0 𝑦 1 z (2) = (2) 𝑦 = z 2 𝑦 2 (2) 𝑦 1 𝑏 1 (2) 𝑦 3 z 3 (2) 𝑦 2 β„Ž Θ 𝑦 𝑏 2 (2) 𝑦 3 𝑏 3 Why do we need g(.)? (2) = 𝑕 Θ 10 (1) 𝑦 0 + Θ 11 (1) 𝑦 1 + Θ 12 (1) 𝑦 2 + Θ 13 (1) 𝑦 3 (2) ) 𝑏 1 = 𝑕(z 1 (2) = 𝑕 Θ 20 (1) 𝑦 0 + Θ 21 (1) 𝑦 1 + Θ 22 (1) 𝑦 2 + Θ 23 (1) 𝑦 3 (2) ) 𝑏 2 = 𝑕(z 2 (2) = 𝑕 Θ 30 (1) 𝑦 0 + Θ 31 (1) 𝑦 1 + Θ 32 (1) 𝑦 2 + Θ 33 (1) 𝑦 3 (2) ) 𝑏 3 = 𝑕(z 3 2 𝑏 0 2 + Θ 11 1 𝑏 1 2 + Θ 12 1 𝑏 2 2 + Θ 13 1 𝑏 3 2 = 𝑕(𝑨 (3) ) β„Ž Θ 𝑦 = 𝑕 Θ 10 Slide credit: Andrew Ng

  14. Neural network β€œPre - activation” 𝑦 0 (2) z 1 (2) 𝑦 0 𝑏 0 𝑦 1 z (2) = (2) 𝑦 = z 2 𝑦 2 (2) 𝑦 1 𝑏 1 (2) 𝑦 3 z 3 (2) 𝑦 2 β„Ž Θ 𝑦 𝑏 2 (2) 𝑦 3 𝑏 3 𝑨 (2) = Θ (1) 𝑦 = Θ (1) 𝑏 (1) (2) = 𝑕(z 1 (2) ) 𝑏 1 𝑏 (2) = 𝑕(𝑨 (2) ) (2) = 𝑕(z 2 (2) ) 𝑏 2 (2) = 1 Add 𝑏 0 (2) = 𝑕(z 3 (2) ) 𝑏 3 𝑨 (3) = Θ (2) 𝑏 (2) β„Ž Θ 𝑦 = 𝑕(𝑨 (3) ) β„Ž Θ 𝑦 = 𝑏 (3) = 𝑕(𝑨 (3) ) Slide credit: Andrew Ng

  15. Flow graph - Forward propagation 𝑨 (2) 𝑨 (3) 𝑏 (2) 𝑏 (3) X β„Ž Θ 𝑦 How do we evaluate 𝑋 (1) 𝑐 (1) 𝑋 (2) 𝑐 (2) our prediction? 𝑨 (2) = Θ (1) 𝑦 = Θ (1) 𝑏 (1) 𝑏 (2) = 𝑕(𝑨 (2) ) (2) = 1 Add 𝑏 0 𝑨 (3) = Θ (2) 𝑏 (2) β„Ž Θ 𝑦 = 𝑏 (3) = 𝑕(𝑨 (3) )

  16. Cost function Logistic regression: Neural network: Slide credit: Andrew Ng

  17. Gradient computation Need to compute: Slide credit: Andrew Ng

  18. Gradient computation Given one training example 𝑦, 𝑧 𝑏 (1) = 𝑦 𝑨 (2) = Θ (1) 𝑏 (1) 𝑏 (2) = 𝑕(𝑨 (2) ) (add a 0 (2) ) 𝑨 (3) = Θ (2) 𝑏 (2) 𝑏 (3) = 𝑕(𝑨 (3) ) (add a 0 (3) ) 𝑨 (4) = Θ (3) 𝑏 (3) 𝑏 (4) = 𝑕 𝑨 4 = β„Ž Θ 𝑦 Slide credit: Andrew Ng

  19. Gradient computation: Backpropagation (π‘š) = β€œerror” of node π‘˜ in layer π‘š Intuition: πœ€ π‘˜ For each output unit (layer L = 4) 𝑨 (3) = Θ (2) 𝑏 (2) πœ€ (4) = 𝑏 (4) βˆ’ 𝑧 𝑏 (3) = 𝑕(𝑨 (3) ) πœ€ (3) = πœ€ (4) πœ– πœ€ (4) πœ–π‘¨ (3) = πœ€ (4) πœ– πœ€ (4) πœ–π‘ (4) πœ–π‘¨ (4) πœ–π‘ (3) 𝑨 (4) = Θ (3) 𝑏 (3) πœ–π‘ (4) πœ–π‘¨ (4) πœ–π‘ (3) πœ–π‘¨ (3) 𝑏 (4) = 𝑕 𝑨 4 = 1 * Θ 3 π‘ˆ πœ€ (4) .βˆ— 𝑕′ 𝑨 4 .βˆ— 𝑕′(𝑨 (3) ) Slide credit: Andrew Ng

  20. Backpropagation algorithm 𝑦 (1) , 𝑧 (1) … 𝑦 (𝑛) , 𝑧 (𝑛) Training set Set Θ (1) = 0 For 𝑗 = 1 to 𝑛 Set 𝑏 (1) = 𝑦 Perform forward propagation to compute 𝑏 (π‘š) for π‘š = 2. . 𝑀 use 𝑧 (𝑗) to compute πœ€ (𝑀) = 𝑏 (𝑀) βˆ’ 𝑧 (𝑗) Compute πœ€ (π‘€βˆ’1) , πœ€ (π‘€βˆ’2) … πœ€ (2) Θ (π‘š) = Θ (π‘š) βˆ’ 𝑏 (π‘š) πœ€ (π‘š+1) Slide credit: Andrew Ng

  21. Activation - sigmoid β€’ Partial derivative 𝑕′ 𝑦 = 𝑕 𝑦 1 βˆ’ 𝑕 𝑦 1 𝑕 𝑦 = 1 + 𝑓 βˆ’π‘¦ Slide credit: Hugo Larochelle

  22. Activation - hyperbolic tangent (tanh) β€’ Partial derivative 𝑕′ 𝑦 = 1 βˆ’ 𝑕 𝑦 2 𝑕 𝑦 = tanh 𝑦 = 𝑓 𝑦 βˆ’ 𝑓 βˆ’π‘¦ 𝑓 𝑦 + 𝑓 βˆ’π‘¦ Slide credit: Hugo Larochelle

  23. Activation - rectified linear(relu) β€’ Partial derivative 𝑕′ 𝑦 = 1 𝑦 > 0 𝑕 𝑦 = relu 𝑦 = max 0, 𝑦 Slide credit: Hugo Larochelle

  24. Initialization β€’ For bias β€’ Initialize all to 0 β€’ For weights β€’ Can’t initialize all weights to the same value β€’ we can show that all hidden units in a layer will always behave the same β€’ need to break symmetry β€’ Recipe: U[-b, b] β€’ the idea is to sample around 0 but break symmetry Slide credit: Hugo Larochelle

  25. Putting it together Pick a network architecture β€’ No. of input units: Dimension of features β€’ No. output units: Number of classes β€’ Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no. of hidden units in every layer (usually the more the better) β€’ Grid search Slide credit: Hugo Larochelle

  26. Putting it together Early stopping β€’ Use a validation set performance to select the best configuration β€’ To select the number of epochs, stop training when validation set error increases Slide credit: Hugo Larochelle

  27. Other tricks of the trade β€’ Normalizing your (real-valued) data β€’ Decaying the learning rate β€’ as we get closer to the optimum, makes sense to take smaller update steps β€’ mini-batch β€’ can give a more accurate estimate of the risk gradient β€’ Momentum β€’ can use an exponential average of previous gradients Slide credit: Hugo Larochelle

  28. Dropout β€’ Idea: Β«crippleΒ» neural network by removing hidden units β€’ each hidden unit is set to 0 with probability 0.5 β€’ hidden units cannot co-adapt to other units β€’ hidden units must be more generally useful Slide credit: Hugo Larochelle

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend