scali ling optim imiz izatio ion
play

Scali ling Optim imiz izatio ion I2DL: Prof. Niessner, Prof. - PowerPoint PPT Presentation

Scali ling Optim imiz izatio ion I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 4 Recap I2DL: Prof. Niessner, Prof. Leal-Taix 2 Neural Network Source: http://cs231n.github.io/neural-networks-1/ I2DL: Prof. Niessner, Prof.


  1. Scali ling Optim imiz izatio ion I2DL: Prof. Niessner, Prof. Leal-Taixé 1

  2. Lecture 4 Recap I2DL: Prof. Niessner, Prof. Leal-Taixé 2

  3. Neural Network Source: http://cs231n.github.io/neural-networks-1/ I2DL: Prof. Niessner, Prof. Leal-Taixé 3

  4. Neural Network Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Input Layer Output Layer Width Depth I2DL: Prof. Niessner, Prof. Leal-Taixé 4

  5. Compute Gra raphs → Neura ral Network rks Output layer Input layer ∗ 𝑥 0 𝑦 0 Loss/ 𝑦 ∗ 𝑦 −𝑧 0 𝑦 0 max(0, 𝑦) + cost 𝑦 1 ∗ 𝑥 1 𝑧 0 ො 𝑧 0 ReLU Activation 𝑦 1 L2 Loss Weights Input (btw. I’m not arguing (unknowns!) this is the right choice here) We want to compute gradients w.r.t. all weights 𝑿 e.g., class label/ regression target I2DL: Prof. Niessner, Prof. Leal-Taixé 5

  6. Compute Gra raphs → Neura ral Network rks ∗ 𝑥 0,0 Loss/ Output layer Input layer 𝑦 ∗ 𝑦 + −𝑧 0 cost ∗ 𝑥 0,1 ∗ 𝑥 1,0 𝑦 0 𝑧 0 ො 𝑧 0 Loss/ 𝑦 ∗ 𝑦 + −𝑧 0 cost 𝑦 0 ∗ 𝑥 1,1 𝑦 1 𝑧 1 ො 𝑧 1 ∗ 𝑥 2,0 𝑦 1 Loss/ 𝑧 2 ො 𝑧 2 𝑦 ∗ 𝑦 + −𝑧 0 cost ∗ 𝑥 2,1 We want to compute gradients w.r.t. all weights 𝑿 I2DL: Prof. Niessner, Prof. Leal-Taixé 6

  7. Compute Gra raphs → Neura ral Network rks Output layer Input layer Goal: We want to compute gradients of the loss function 𝑀 w.r.t. all weights 𝑥 𝑀 = ෍ 𝑀 𝑗 𝑦 0 𝑗 𝑧 0 ො 𝑧 0 𝑀 : sum over loss per sample, e.g. L2 loss ⟶ simply sum up squares: … 𝑧 𝑗 − 𝑧 𝑗 2 𝑀 𝑗 = ො 𝑧 1 ො 𝑧 1 𝑦 𝑙 ⟶ use chain rule to compute partials 𝑧 𝑗 = 𝐵(𝑐 𝑗 + ෍ ො 𝑦 𝑙 𝑥 𝑗,𝑙 ) 𝜖𝑀 = 𝜖𝑀 ⋅ 𝜖 ො 𝑧 𝑗 𝑙 𝜖𝑥 𝑗,𝑙 𝜖 ො 𝑧 𝑗 𝜖𝑥 𝑗,𝑙 Activation bias function We want to compute gradients w.r.t. all weights 𝑿 AN AND all biases 𝑐 I2DL: Prof. Niessner, Prof. Leal-Taixé 7

  8. Summary 𝜖𝑔 • We have 𝜖𝑥 0,0,0 … – (Directional) compute graph … 𝜖𝑔 – Structure graph into layers 𝛼 𝑿 𝑔 𝒚,𝒛 (𝑿) = 𝜖𝑥 𝑚,𝑛,𝑜 – Compute partial derivatives w.r.t. … … weights (unknowns) 𝜖𝑔 𝜖𝑐 𝑚,𝑛 • Next Gradient step: – Find weights based on gradients 𝑿 ′ = 𝑿 − 𝛽𝛼 𝑿 𝑔 𝒚,𝒛 (𝑿) I2DL: Prof. Niessner, Prof. Leal-Taixé 8

  9. Optim imiz izatio ion I2DL: Prof. Niessner, Prof. Leal-Taixé 9

  10. Gra radie ient Descent 𝑦 ∗ = arg min 𝑔(𝑦) Initialization Optimum I2DL: Prof. Niessner, Prof. Leal-Taixé 10

  11. Gra radie ient Descent 𝑦 ∗ = arg min 𝑔(𝑦) Initialization Follow the slope of the DERIVATIVE 𝑒𝑔(𝑦) 𝑔 𝑦 + ℎ − 𝑔(𝑦) Optimum = lim 𝑒𝑦 ℎ ℎ→0 I2DL: Prof. Niessner, Prof. Leal-Taixé 11

  12. Gra radie ient Descent • From derivative to gradient Direction of greatest increase of the function ⅆ𝑔 𝑦 𝛼 𝑦 𝑔 𝑦 ⅆ𝑦 • Gradient steps in direction of negative gradient 𝑦 𝛼 𝑦 𝑔(𝑦) 𝑦 ′ = 𝑦 − 𝛽𝛼 𝑦 𝑔 𝑦 Learning rate I2DL: Prof. Niessner, Prof. Leal-Taixé 12

  13. Gra radie ient Descent • From derivative to gradient Direction of greatest increase of the function ⅆ𝑔 𝑦 𝛼 𝑦 𝑔 𝑦 ⅆ𝑦 • Gradient steps in direction of negative gradient 𝑦 𝛼 𝑦 𝑔(𝑦) 𝑦 ′ = 𝑦 − 𝛽𝛼 𝑦 𝑔 𝑦 SMALL Learning rate I2DL: Prof. Niessner, Prof. Leal-Taixé 13

  14. Gra radie ient Descent • From derivative to gradient Direction of greatest increase of the function ⅆ𝑔 𝑦 𝛼 𝑦 𝑔 𝑦 ⅆ𝑦 • Gradient steps in direction of negative gradient 𝑦 𝛼 𝑦 𝑔(𝑦) 𝑦 ′ = 𝑦 − 𝛽𝛼 𝑦 𝑔 𝑦 LARGE Learning rate I2DL: Prof. Niessner, Prof. Leal-Taixé 14

  15. Gra radie ient Descent 𝒚 ∗ = arg min 𝑔(𝒚) Initialization What is the gradient when Not guaranteed we reach this to reach the Optimum point? global optimum I2DL: Prof. Niessner, Prof. Leal-Taixé 15

  16. Convergence of f Gra radient Descent • Convex function: all local minima are global minima Source: https://en.wikipedia.org/wiki/Convex_function#/media/File:ConvexFunction.svg If line/plane segment between any two points lies above or on the graph I2DL: Prof. Niessner, Prof. Leal-Taixé 16

  17. Convergence of f Gra radient Descent • Neural networks are non-convex – many (different) local minima – no (practical) way to say which is globally optimal Source: Li, Qi. (2006). Challenging Registration of Geologic Image Data I2DL: Prof. Niessner, Prof. Leal-Taixé 17

  18. Convergence of f Gra radient Descent Source: https://builtin.com/data-science/gradient-descent I2DL: Prof. Niessner, Prof. Leal-Taixé 18

  19. Convergence of f Gra radient Descent Source: A. Geron I2DL: Prof. Niessner, Prof. Leal-Taixé 19

  20. Gra radie ient Descent: : Mult ltip iple Dim imensio ions Source: builtin.com/data-science/gradient-descent Various ways to visualize… I2DL: Prof. Niessner, Prof. Leal-Taixé 20

  21. Gra radie ient Descent: : Mult ltip iple Dim imensio ions Source: http://blog.datumbox.com/wp-content/uploads/2013/10/gradient-descent.png I2DL: Prof. Niessner, Prof. Leal-Taixé 21

  22. Gra radie ient Descent fo for r Neura ral Networks Loss function 𝜖𝑔 𝑧 𝑗 − 𝑧 𝑗 2 𝑀 𝑗 = ො 𝜖𝑥 0,0,0 … ℎ 0 … 𝑦 0 𝜖𝑔 𝑧 0 ො ℎ 1 𝑧 0 𝛼 𝑿,𝒄 𝑔 𝒚,𝒛 (𝑿) = 𝜖𝑥 𝑚,𝑛,𝑜 𝑦 1 … 𝑧 1 ො ℎ 2 𝑧 1 … 𝜖𝑔 𝑦 2 ℎ 3 𝜖𝑐 𝑚,𝑛 𝑧 𝑗 = 𝐵(𝑐 1,𝑗 + ෍ ො ℎ 𝑘 𝑥 1,𝑗,𝑘 ) 𝑘 Just simple: ℎ 𝑘 = 𝐵(𝑐 0,𝑘 + ෍ 𝑦 𝑙 𝑥 0,𝑘,𝑙 ) 𝐵 𝑦 = max(0, 𝑦) 𝑙 I2DL: Prof. Niessner, Prof. Leal-Taixé 22

  23. Gra radie ient Descent: : Sin ingle Tra rain inin ing Sample • Given a loss function 𝑀 and a single training sample {𝒚 𝑗 , 𝒛 𝑗 } • Find best model parameters 𝜾 = 𝑿, 𝒄 • Cost 𝑀 𝑗 𝜾, 𝒚 𝑗 , 𝒛 𝑗 – 𝜾 = arg min 𝑀 𝑗 (𝒚 𝑗 , 𝒛 𝑗 ) • Gradient Descent: – Initialize 𝜾 1 with ‘random’ values (more to that later) – 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽𝛼 𝜾 𝑀 𝑗 (𝜾 𝑙 , 𝒚 𝑗 , 𝒛 𝑗 ) – Iterate until convergence: 𝜾 𝑙+1 − 𝜾 𝑙 < 𝜗 I2DL: Prof. Niessner, Prof. Leal-Taixé 23

  24. Gra radie ient Descent: : Sin ingle Tra rain inin ing Sample – 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽𝛼 𝜾 𝑀 𝑗 (𝜾 𝑙 , 𝒚 𝑗 , 𝒛 𝑗 ) Training sample Loss Function Weights, biases after Gradient w.r.t. 𝜾 update step Learning rate Weights, biases at step k (current model) – 𝛼 𝜾 𝑀 𝑗 𝜾 𝑙 , 𝒚 𝑗 , 𝒛 𝒋 computed via backpropagation – Typically: ⅆim 𝛼 𝜾 𝑀 𝑗 𝜾 𝑙 , 𝒚 𝑗 , 𝒛 𝑗 = ⅆim 𝜾 ≫ 1 𝑛𝑗𝑚𝑚𝑗𝑝𝑜 I2DL: Prof. Niessner, Prof. Leal-Taixé 24

  25. Gra radie ient Descent: : Mult ltip iple Tra rain inin ing Samples • Given a loss function 𝑀 and multiple ( 𝑜 ) training samples {𝒚 𝑗 , 𝒛 𝑗 } • Find best model parameters 𝜾 = 𝑿, 𝒄 1 𝑜 • Cost 𝑀 = 𝑜 σ 𝑗=1 𝑀 𝑗 (𝜾, 𝒚 𝑗 , 𝒛 𝑗 ) – 𝜾 = arg min 𝑀 I2DL: Prof. Niessner, Prof. Leal-Taixé 25

  26. Gra radie ient Descent: : Mult ltip iple Tra rain inin ing Samples • Update step for multiple samples 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽𝛼 𝜾 𝑀 𝜾 𝑙 , 𝒚 1..𝑜 , 𝒛 1..𝑜 • Gradient is average / sum over residuals = 1 𝑜 𝛼 𝜾 𝑀 𝜾 𝑙 , 𝒚 1..𝑜 , 𝒛 1..𝑜 𝛼 𝜾 𝑀 𝑗 𝜾 𝑙 , 𝒚 𝑗 , 𝒛 𝒋 𝑜 σ 𝑗=1 Reminder: this comes from backprop. 𝑜 • Often people are lazy and just write: 𝛼𝑀 = σ 𝑗=1 𝛼 𝜾 𝑀 𝑗  omitting 1 𝑜 is not ‘wrong’, it just means rescaling the learning rate I2DL: Prof. Niessner, Prof. Leal-Taixé 26

  27. Sid ide Note: : Optim imal Learnin ing Rate Can compute optimal learning rate 𝛽 using Line Search (optimal for a given set) 1 𝑜 1. Compute gradient: 𝛼 𝜾 𝑀 = 𝑜 σ 𝑗=1 𝛼 𝜾 𝑀 𝑗 2. Optimize for optimal step 𝛽 : 𝛽 𝑀(𝜾 𝑙 − 𝛽 𝛼 𝜾 𝑀) arg min 𝜾 𝑙+1 Not that practical for DL since we 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽𝛼 𝜾 𝑀 3. need to solve huge system every step… I2DL: Prof. Niessner, Prof. Leal-Taixé 27

  28. Gra radie ient Descent on Tra rain in Set • Given large train set with 𝑜 training samples {𝒚 𝑗 , 𝒛 𝑗 } – Let’s say 1 million labeled images – Let’s say our network has 500k parameters • Gradient has 500k dimensions • 𝑜 = 1 𝑛𝑗𝑚𝑚𝑗𝑝𝑜 → Extremely expensive to compute I2DL: Prof. Niessner, Prof. Leal-Taixé 28

  29. Stochastic Gra radient Descent (S (SGD) • If we have 𝑜 training samples we need to compute the gradient for all of them which is 𝑃(𝑜) • If we consider the problem as empirical risk minimization, we can express the total loss over the training data as the expectation of all the samples 𝑜 1 𝑜 ෍ 𝑀 𝑗 𝜾, 𝒚 𝒋 , 𝒛 𝒋 = 𝔽 𝑗~ 1,…,𝑜 𝑀 𝑗 𝜾, 𝒚 𝒋 , 𝒛 𝒋 𝑗=1 I2DL: Prof. Niessner, Prof. Leal-Taixé 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend