backpropagation
play

Backpropagation I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 3 - PowerPoint PPT Presentation

Optimization and Backpropagation I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 3 Recap I2DL: Prof. Niessner, Prof. Leal-Taix 2 Neural Network Linear score function = On CIFAR-10 Credit: Li/Karpathy/Johnson On


  1. Optimization and Backpropagation I2DL: Prof. Niessner, Prof. Leal-Taixé 1

  2. Lecture 3 Recap I2DL: Prof. Niessner, Prof. Leal-Taixé 2

  3. Neural Network • Linear score function 𝒈 = 𝑿𝒚 On CIFAR-10 Credit: Li/Karpathy/Johnson On ImageNet I2DL: Prof. Niessner, Prof. Leal-Taixé 3

  4. Neural Network • Linear score function 𝒈 = 𝑿𝒚 • Neural network is a nesting of ‘functions’ – 2-layers: 𝒈 = 𝑿 𝟑 max(𝟏, 𝑿 𝟐 𝒚) – 3-layers: 𝒈 = 𝑿 𝟒 max(𝟏, 𝑿 𝟑 max(𝟏, 𝑿 𝟐 𝒚)) – 4-layers: 𝒈 = 𝑿 𝟓 tanh (𝑿 𝟒 , max(𝟏, 𝑿 𝟑 max(𝟏, 𝑿 𝟐 𝒚))) – 5-layers: 𝒈 = 𝑿 𝟔 𝜏(𝑿 𝟓 tanh(𝑿 𝟒 , max(𝟏, 𝑿 𝟑 max(𝟏, 𝑿 𝟐 𝒚)))) – … up to hundreds of layers I2DL: Prof. Niessner, Prof. Leal-Taixé 4

  5. Neural Network Output layer Input layer Hidden layer Credit: Li/Karpathy/Johnson I2DL: Prof. Niessner, Prof. Leal-Taixé 5

  6. Neural Network Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Input Layer Output Layer Width Depth I2DL: Prof. Niessner, Prof. Leal-Taixé 6

  7. Activation Functions Leaky ReLU: max 0.1𝑦, 𝑦 1 Sigmoid: 𝜏 𝑦 = (1+𝑓 −𝑦 ) tanh: tanh 𝑦 Parametric ReLU: max 𝛽𝑦, 𝑦 Maxout max 𝑥 1 𝑈 𝑦 + 𝑐 1 , 𝑥 2 𝑈 𝑦 + 𝑐 2 ReLU: max 0, 𝑦 𝑦 if 𝑦 > 0 ELU f x = ቊ α e 𝑦 − 1 if 𝑦 ≤ 0 I2DL: Prof. Niessner, Prof. Leal-Taixé 7

  8. Loss Functions • Measure the goodness of the predictions (or equivalently, the network's performance) • Regression loss 𝑜 | 𝑧 𝑗 − ෝ 𝒛; 𝜾 = 1 – L1 loss 𝑀 𝒛, ෝ 𝑜 σ 𝑗 𝑧 𝑗 | 1 𝑜 | 𝑧 𝑗 − ෝ 𝒛; 𝜾 = 1 2 – MSE loss 𝑀 𝒛, ෝ 𝑜 σ 𝑗 𝑧 𝑗 | 2 • Classification loss (for multi-class classification) 𝑜 – Cross Entropy loss E 𝒛, ෝ 𝑙 𝒛; 𝜾 = − σ 𝑗=1 σ 𝑙=1 (𝑧 𝑗𝑙 ∙ log ො 𝑧 𝑗𝑙 ) I2DL: Prof. Niessner, Prof. Leal-Taixé 8

  9. Computational Graphs • Neural network is a computational graph – It has compute nodes – It has edges that connect nodes – It is directional – It is organized in ‘layers’ I2DL: Prof. Niessner, Prof. Leal-Taixé 9

  10. Backprop I2DL: Prof. Niessner, Prof. Leal-Taixé 10

  11. The Importance of Gradients • Our optimization schemes are based on computing gradients 𝛼 𝜾 𝑀 𝜾 • One can compute gradients analytically but what if our function is too complex? • Break down gradient computation Backpropagation Rumelhart 1986 I2DL: Prof. Niessner, Prof. Leal-Taixé 11

  12. Backprop: Forward Pass • 𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 ⋅ 𝑨 Initialization 𝑦 = 1, 𝑧 = −3, 𝑨 = 4 1 1 𝑒 = −2 −3 sum 𝑔 = −8 −3 mult 4 4 I2DL: Prof. Niessner, Prof. Leal-Taixé 12

  13. Backprop: Backward Pass 𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 ⋅ 𝑨 1 1 with 𝑦 = 1, 𝑧 = −3, 𝑨 = 4 𝑒 = −2 −3 sum 𝑔 = −8 −3 mult 𝜖𝑒 𝜖𝑒 4 𝜖𝑦 = 1 , 𝑒 = 𝑦 + 𝑧 𝜖𝑧 = 1 4 𝜖𝑔 𝜖𝑔 𝜖𝑒 = 𝑨 , 𝑔 = 𝑒 ⋅ 𝑨 𝜖𝑨 = 𝑒 𝜖𝑔 𝜖𝑔 𝜖𝑔 What is 𝜖𝑨 ? 𝜖𝑦 , 𝜖𝑧 , I2DL: Prof. Niessner, Prof. Leal-Taixé 13

  14. Backprop: Backward Pass 𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 ⋅ 𝑨 1 1 with 𝑦 = 1, 𝑧 = −3, 𝑨 = 4 𝑒 = −2 −3 sum 𝑔 = −8 −3 1 mult 𝜖𝑒 𝜖𝑒 4 𝜖𝑦 = 1 , 𝑒 = 𝑦 + 𝑧 𝜖𝑧 = 1 4 𝜖𝑔 𝜖𝑔 𝜖𝑒 = 𝑨 , 𝑔 = 𝑒 ⋅ 𝑨 𝜖𝑨 = 𝑒 𝜖𝑔 𝜖𝑔 𝜖𝑔 𝜖𝑔 𝜖𝑔 What is 𝜖𝑨 ? 𝜖𝑦 , 𝜖𝑧 , I2DL: Prof. Niessner, Prof. Leal-Taixé 14

  15. Backprop: Backward Pass 𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 ⋅ 𝑨 1 1 with 𝑦 = 1, 𝑧 = −3, 𝑨 = 4 𝑒 = −2 −3 sum 𝑔 = −8 −3 1 mult 𝜖𝑒 𝜖𝑒 4 𝜖𝑦 = 1 , 𝑒 = 𝑦 + 𝑧 𝜖𝑧 = 1 −2 4 −2 𝜖𝑔 𝜖𝑔 𝜖𝑒 = 𝑨 , 𝑔 = 𝑒 ⋅ 𝑨 𝜖𝑨 = 𝑒 𝜖𝑔 𝜖𝑨 𝜖𝑔 𝜖𝑔 𝜖𝑔 What is 𝜖𝑨 ? 𝜖𝑦 , 𝜖𝑧 , I2DL: Prof. Niessner, Prof. Leal-Taixé 15

  16. Backprop: Backward Pass 𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 ⋅ 𝑨 1 1 with 𝑦 = 1, 𝑧 = −3, 𝑨 = 4 𝑒 = −2 −3 4 sum 𝑔 = −8 −3 1 mult 𝜖𝑒 𝜖𝑒 4 𝜖𝑦 = 1 , 𝑒 = 𝑦 + 𝑧 𝜖𝑧 = 1 −2 4 −2 𝜖𝑔 𝜖𝑔 𝜖𝑒 = 𝑨 , 𝑔 = 𝑒 ⋅ 𝑨 𝜖𝑨 = 𝑒 𝜖𝑔 𝜖𝑒 𝜖𝑔 𝜖𝑔 𝜖𝑔 What is 𝜖𝑨 ? 𝜖𝑦 , 𝜖𝑧 , I2DL: Prof. Niessner, Prof. Leal-Taixé 16

  17. Backprop: Backward Pass 𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 ⋅ 𝑨 1 1 with 𝑦 = 1, 𝑧 = −3, 𝑨 = 4 𝑒 = −2 −3 4 4 sum 𝑔 = −8 −3 4 1 mult 𝜖𝑒 𝜖𝑒 4 𝜖𝑦 = 1 , 𝑒 = 𝑦 + 𝑧 𝜖𝑧 = 1 4 −2 𝜖𝑔 𝜖𝑔 𝜖𝑒 = 𝑨 , 𝑔 = 𝑒 ⋅ 𝑨 𝜖𝑨 = 𝑒 Chain Rule: 𝜖𝑔 𝜖𝑧 = 𝜖𝑔 𝜖𝑔 𝜖𝑒 ⋅ 𝜖𝑒 𝜖𝑧 𝜖𝑧 𝜖𝑔 𝜖𝑔 𝜖𝑔 What is 𝜖𝑨 ? 𝜖𝑦 , 𝜖𝑧 , → 𝜖𝑔 𝜖𝑧 = 4 ⋅ 1 = 4 I2DL: Prof. Niessner, Prof. Leal-Taixé 17

  18. Backprop: Backward Pass 𝑔 𝑦, 𝑧, 𝑨 = 𝑦 + 𝑧 ⋅ 𝑨 1 1 4 4 with 𝑦 = 1, 𝑧 = −3, 𝑨 = 4 𝑒 = −2 −3 4 4 sum 𝑔 = −8 −3 4 1 mult 𝜖𝑒 𝜖𝑒 4 𝜖𝑦 = 1 , 𝑒 = 𝑦 + 𝑧 𝜖𝑧 = 1 −2 4 −2 𝜖𝑔 𝜖𝑔 𝜖𝑒 = 𝑨 , 𝑔 = 𝑒 ⋅ 𝑨 𝜖𝑨 = 𝑒 Chain Rule: 𝜖𝑔 𝜖𝑔 𝜖𝑦 = 𝜖𝑔 𝜖𝑒 ⋅ 𝜖𝑒 𝜖𝑦 𝜖𝑦 𝜖𝑔 𝜖𝑔 𝜖𝑔 What is 𝜖𝑨 ? 𝜖𝑦 , 𝜖𝑧 , → 𝜖𝑔 𝜖𝑦 = 4 ⋅ 1 = 4 I2DL: Prof. Niessner, Prof. Leal-Taixé 18

  19. Compute Graphs -> Neural Networks • 𝑦 𝑙 input variables • 𝑥 𝑚,𝑛,𝑜 network weights (note 3 indices) – 𝑚 which layer – 𝑛 which neuron in layer – 𝑜 which weight in neuron • 𝑧 𝑗 computed output ( 𝑗 output dim; 𝑜 𝑝𝑣𝑢 ) ො • 𝑧 𝑗 ground truth targets • 𝑀 loss function I2DL: Prof. Niessner, Prof. Leal-Taixé 19

  20. Compute Graphs -> Neural Networks Input layer Output layer ∗ 𝑥 0 𝑦 0 𝑦 0 Loss/ 𝑦 ∗ 𝑦 + −𝑧 0 cost 𝑧 0 ො 𝑧 0 𝑦 1 ∗ 𝑥 1 𝑦 1 L2 Loss Weights Input function (unknowns!) e.g., class label/ regression target I2DL: Prof. Niessner, Prof. Leal-Taixé 20

  21. Compute Graphs -> Neural Networks Input layer Output layer ∗ 𝑥 0 𝑦 0 Loss/ 𝑦 ∗ 𝑦 −𝑧 0 + max(0, 𝑦) 𝑦 0 cost 𝑦 1 ∗ 𝑥 1 𝑧 0 ො 𝑧 0 ReLU Activation 𝑦 1 L2 Loss Weights Input (btw. I’m not arguing function (unknowns!) this is the right choice here) We want to compute gradients w.r.t. all weights 𝑿 e.g., class label/ regression target I2DL: Prof. Niessner, Prof. Leal-Taixé 21

  22. Compute Graphs -> Neural Networks ∗ 𝑥 0,0 Loss/ Output layer Input layer 𝑦 ∗ 𝑦 + −𝑧 0 cost ∗ 𝑥 0,1 ∗ 𝑥 1,0 𝑧 0 ො 𝑧 0 𝑦 0 Loss/ 𝑦 ∗ 𝑦 + −𝑧 0 cost 𝑦 0 ∗ 𝑥 1,1 𝑦 1 𝑧 1 ො 𝑧 1 ∗ 𝑥 2,0 𝑦 1 Loss/ 𝑧 2 ො 𝑧 2 𝑦 ∗ 𝑦 −𝑧 0 + cost ∗ 𝑥 2,1 We want to compute gradients w.r.t. all weights 𝑿 I2DL: Prof. Niessner, Prof. Leal-Taixé 22

  23. Compute Graphs -> Neural Networks Output layer Input layer Goal: We want to compute gradients of the loss function 𝑀 w.r.t. all weights 𝑿 𝑀 = ෍ 𝑀 𝑗 𝑦 0 𝑗 𝑧 0 ො 𝑧 0 𝑀 : sum over loss per sample, e.g. L2 loss ⟶ simply sum up squares: … 𝑧 𝑗 − 𝑧 𝑗 2 𝑀 𝑗 = ො 𝑧 1 ො 𝑧 1 𝑦 𝑙 ⟶ use chain rule to compute partials 𝑧 𝑗 = 𝐵(𝑐 𝑗 + ෍ ො 𝑦 𝑙 𝑥 𝑗,𝑙 ) 𝜖𝑀 𝑗 = 𝜖𝑀 𝑗 ⋅ 𝜖 ො 𝑧 𝑗 𝑙 𝜖𝑥 𝑗,𝑙 𝜖 ො 𝑧 𝑗 𝜖𝑥 𝑗,𝑙 Activation bias function We want to compute gradients w.r.t. all weights 𝑿 AND all biases 𝒄 I2DL: Prof. Niessner, Prof. Leal-Taixé 23

  24. NNs as Computational Graphs • We can express any kind of functions in a 1 computational graph, e.g. 𝑔 𝒙, 𝒚 = 1+𝑓 − 𝑐+𝑥0𝑦0+𝑥1𝑦1 𝑥 1 ∗ Sigmoid function 𝑦 1 1 + 𝜏 𝑦 = 1 + 𝑓 −𝑦 𝑥 0 ∗ 1 + exp(∙) +1 ∙ −1 𝑦 0 ∙ 𝑐 I2DL: Prof. Niessner, Prof. Leal-Taixé 24

  25. NNs as Computational Graphs 1 • 𝑔 𝒙, 𝒚 = 1+𝑓 − 𝑐+𝑥0𝑦0+𝑥1𝑦1 2 𝑥 1 ∗ −1 −2 𝑦 1 + −3 4 6 𝑥 0 ∗ 1 −1 1.37 0.73 0.37 1 −2 + exp(∙) +1 ∙ −1 𝑦 0 ∙ −3 𝑐 I2DL: Prof. Niessner, Prof. Leal-Taixé 25

  26. NNs as Computational Graphs 1 • 𝑔 𝒙, 𝒚 = 𝑕 𝑦 = 1 𝜖𝑕 1 ⇒ 𝜖𝑦 = − 1+𝑓 − 𝑐+𝑥0𝑦0+𝑥1𝑦1 𝑦 2 𝑦 𝑕 𝛽 𝑦 = 𝛽 + 𝑦 ⇒ 𝜖𝑕 𝜖𝑦 = 1 2 𝑥 1 ⇒ 𝜖𝑕 𝑕 𝑦 = 𝑓 𝑦 𝜖𝑦 = 𝑓 𝑦 ∗ −1 −2 𝜖𝑕 𝑕 𝛽 𝑦 = 𝛽𝑦 ⇒ 𝜖𝑦 = 𝛽 𝑦 1 + 1 1 ∙ − 1.37 2 = −0.53 −3 4 6 𝑥 0 ∗ 1 −1 1.37 0.73 0.37 1 −2 + exp(∙) +1 ∙ −1 𝑦 0 ∙ −0.53 1 −3 𝑐 I2DL: Prof. Niessner, Prof. Leal-Taixé 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend