basic ics of f dl
play

Basic ics of f DL Prof. Leal-Taix and Prof. Niessner 1 What we - PowerPoint PPT Presentation

Basic ics of f DL Prof. Leal-Taix and Prof. Niessner 1 What we assume you know Linear Algebra & Programming! Basics from the Introduction to Deep Learning lecture PyTorch (can use TensorFlow ) You have trained already


  1. Basic ics of f DL Prof. Leal-Taixé and Prof. Niessner 1

  2. What we assume you know • Linear Algebra & Programming! • Basics from the Introduction to Deep Learning lecture • PyTorch (can use TensorFlow …) • You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data. Prof. Leal-Taixé and Prof. Niessner 2

  3. What is is a N Neural l Network? Prof. Leal-Taixé and Prof. Niessner 3

  4. Neural Network • Linear score function 𝑔 = 𝑋𝑦 On CIFAR-10 Credit: Li/Karpathy/Johnson On ImageNet Prof. Leal-Taixé and Prof. Niessner 4

  5. Neural Network • Linear score function 𝑔 = 𝑋𝑦 • Neural network is a nesting of ‘functions’ – 2-layers: 𝑔 = 𝑋 2 max(0, 𝑋 1 𝑦) – 3-layers: 𝑔 = 𝑋 3 max(0, 𝑋 2 max(0, 𝑋 1 𝑦)) – 4-layers: 𝑔 = 𝑋 4 tanh (W 3 , max(0, 𝑋 2 max(0, 𝑋 1 𝑦))) – 5-layers: 𝑔 = 𝑋 5 𝜏(𝑋 4 tanh(W 3 , max(0, 𝑋 2 max(0, 𝑋 1 𝑦)))) – … up to hundreds of layers Prof. Leal-Taixé and Prof. Niessner 5

  6. Neural Network 1-layer network: 𝑔 = 𝑋𝑦 2-layer network: 𝑔 = 𝑋 2 max(0, 𝑋 1 𝑦) 𝑦 𝑦 𝑋 𝑔 𝑋 𝑋 𝑔 ℎ 1 2 128 × 128 = 16384 1000 10 128 × 128 = 16384 10 Prof. Leal-Taixé and Prof. Niessner 6

  7. Neural Network Credit: Li/Karpathy/Johnson Prof. Leal-Taixé and Prof. Niessner 7

  8. Loss functio ions Prof. Leal-Taixé and Prof. Niessner 8

  9. Neural networks What is the shape of this function? Loss (Softmax, Hinge) Prediction Prof. Leal-Taixé and Prof. Niessner 9

  10. Loss fu functio ions Evaluate the ground • Softmax loss function truth score for the image • Hinge Loss (derived from the Multiclass SVM loss) Prof. Leal-Taixé and Prof. Niessner 10

  11. Loss fu functio ions • Softmax loss function – Optimizes until the loss is zero • Hinge Loss (derived from the Multiclass SVM loss) – Saturates whenever it has learned a class “well enough” Prof. Leal-Taixé and Prof. Niessner 11

  12. Activ ivatio ion functio ions Prof. Leal-Taixé and Prof. Niessner 12

  13. Sig igmoid id Forward Saturated neurons kill the gradient flow Prof. Leal-Taixé and Prof. Niessner 13

  14. Pro roblem of f positiv ive output More on zero- mean data later Prof. Leal-Taixé and Prof. Niessner 14

  15. tanh Still saturates Zero- centered Still saturates LeCun 1991 Prof. Leal-Taixé and Prof. Niessner 15

  16. Rectif ifie ied Lin inear Unit its (ReLU) Dead ReLU Large and What happens if a consistent ReLU outputs zero? gradients Fast convergence Does not saturate Prof. Leal-Taixé and Prof. Niessner 16

  17. Maxout unit its Linear Generalization Does not Does not regimes of ReLUs die saturate Increase of the number of parameters Prof. Leal-Taixé and Prof. Niessner 17

  18. Optim imiz izatio ion Prof. Leal-Taixé and Prof. Niessner 18

  19. Gra radie ient Descent fo for r Neura ral Networks ℎ 0 𝜖𝑔 𝑦 0 𝑧 0 𝑢 0 ℎ 1 𝜖𝑥 0,0,0 … 𝑦 1 … 𝑧 1 𝑢 1 ℎ 2 𝜖𝑔 𝛼 𝑥,𝑐 𝑔 𝑦,𝑢 (𝑥) = 𝑦 2 𝜖𝑥 𝑚,𝑛,𝑜 … ℎ 3 … 𝜖𝑔 𝑀 𝑗 = 𝑧 𝑗 − 𝑢 𝑗 2 𝜖𝑐 𝑚,𝑛 𝑧 𝑗 = 𝐵(𝑐 1,𝑗 + ෍ ℎ 𝑘 𝑥 1,𝑗,𝑘 ) ℎ 𝑘 = 𝐵(𝑐 0,𝑘 + ෍ 𝑦 𝑙 𝑥 0,𝑘,𝑙 ) 𝑘 𝑙 Just simple: 𝐵 𝑦 = max(0, 𝑦) Prof. Leal-Taixé and Prof. Niessner 19

  20. Stochastic Gra radient Descent (S (SGD) 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽𝛼 𝜄 𝑀(𝜄 𝑙 , 𝑦 {1..𝑛} , 𝑧 {1..𝑛} ) 𝑛 𝛼 𝜄 𝑀 𝑗 1 𝑛 σ 𝑗=1 𝛼 𝜄 𝑀 = 𝑙 now refers to 𝑙 -th iteration 𝑛 training samples in the current batch Gradient for the 𝑙 -th batch Note the terminology: iteration vs epoch 20 Prof. Leal-Taixé and Prof. Niessner

  21. Gra radie ient Descent wit ith Momentum 𝑤 𝑙+1 = 𝛾 ⋅ 𝑤 𝑙 + 𝛼 𝜄 𝑀(𝜄 𝑙 ) accumulation rate Gradient of current minibatch velocity (‘friction’, momentum) 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 velocity model learning rate Exponentially-weighted average of gradient Important: velocity 𝑤 𝑙 is vector-valued! 21 Prof. Leal-Taixé and Prof. Niessner

  22. Gra radie ient Descent wit ith Momentum Step will be largest when a sequence of gradients all point to the same direction Hyperparameters are 𝛽, 𝛾 𝛾 is often set to 0.9 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 Prof. Leal-Taixé and Prof. Niessner 22 Fig. credit: I. Goodfellow

  23. RMSProp 𝑡 𝑙+1 = 𝛾 ⋅ 𝑡 𝑙 + (1 − 𝛾)[𝛼 𝜄 𝑀 ∘ 𝛼 𝜄 𝑀] Element-wise multiplication 𝛼 𝜄 𝑀 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑡 𝑙+1 + 𝜗 Hyperparameters: 𝛽 , 𝛾 , 𝜗 Needs tuning! Often 0.9 Typically 10 −8 23 Prof. Leal-Taixé and Prof. Niessner

  24. RMSProp Large gradients Y-Direction X-direction Small gradients (uncentered) variance of gradients 𝑡 𝑙+1 = 𝛾 ⋅ 𝑡 𝑙 + (1 − 𝛾)[𝛼 𝜄 𝑀 ∘ 𝛼 𝜄 𝑀] -> second momentum 𝛼 𝜄 𝑀 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑡 𝑙+1 + 𝜗 We’re dividing by square gradients: - Division in Y-Direction will be large Can increase learning rate! - Division in X-Direction will be small Prof. Leal-Taixé and Prof. Niessner 24 Fig. credit: A. Ng

  25. Adaptiv ive Moment Estim imatio ion (A (Adam) Combines Momentum and RMSProp First momentum: 𝑛 𝑙+1 = 𝛾 1 ⋅ 𝑛 𝑙 + 1 − 𝛾 1 𝛼 𝜄 𝑀 𝜄 𝑙 mean of gradients 𝑤 𝑙+1 = 𝛾 2 ⋅ 𝑤 𝑙 + (1 − 𝛾 2 )[𝛼 𝜄 𝑀 𝜄 𝑙 ∘ 𝛼 𝜄 𝑀 𝜄 𝑙 ] Second momentum: 𝑛 𝑙+1 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ variance of gradients 𝑤 𝑙+1 +𝜗 25 Prof. Leal-Taixé and Prof. Niessner

  26. Adam Combines Momentum and RMSProp 𝑛 𝑙+1 and 𝑤 𝑙+1 are initialized with zero 𝑛 𝑙+1 = 𝛾 1 ⋅ 𝑛 𝑙 + 1 − 𝛾 1 𝛼 𝜄 𝑀 𝜄 𝑙 -> bias towards zero 𝑤 𝑙+1 = 𝛾 2 ⋅ 𝑤 𝑙 + (1 − 𝛾 2 )[𝛼 𝜄 𝑀 𝜄 𝑙 ∘ 𝛼 𝜄 𝑀 𝜄 𝑙 ] Typically, bias-corrected moment updates 𝑛 𝑙 𝑛 𝑙+1 = ෝ 1 − 𝛾 1 𝑤 𝑙 𝑛 𝑙+1 ෝ 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 = ො 1 − 𝛾 2 𝑤 𝑙+1 +𝜗 ො Prof. Leal-Taixé and Prof. Niessner 26

  27. Convergence 27

  28. Train inin ing NNs Prof. Leal-Taixé and Prof. Niessner 28

  29. Im Importance of f Learning Rate Prof. Leal-Taixé and Prof. Niessner 29

  30. Over- and Underf rfitting Figure extracted from Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017 Prof. Leal-Taixé and Prof. Niessner 30

  31. Over- and Underf rfitting Source: http://srdas.github.io/DLBook/ImprovingModelGeneralization.html Prof. Leal-Taixé and Prof. Niessner 31

  32. Basic re recip ipe fo for r machine le learnin ing • Split your data 60% 20% 20% validation test train Find your hyperparameters Prof. Leal-Taixé and Prof. Niessner 32

  33. Basic re recip ipe fo for r machine le learnin ing Prof. Leal-Taixé and Prof. Niessner 33

  34. Regula lariz izatio ion Prof. Leal-Taixé and Prof. Niessner 34

  35. Regularization • Any strategy that aims to Lower In Increasing vali lidatio ion error train inin ing error Prof. Leal-Taixé and Prof. Niessner 35

  36. Data augmentatio ion Krizhevsky 2012 Prof. Leal-Taixé and Prof. Niessner 36

  37. Earl rly stopping Training time is also a hyperparameter Overfitting Prof. Leal-Taixé and Prof. Niessner 37

  38. Bagging and ensemble methods • Bagging: uses k different datasets Training Set 3 Training Set 2 Training Set 1 Prof. Leal-Taixé and Prof. Niessner 38

  39. Dro ropout • Disable a random set of neurons (typically 50%) Forward Srivastava 2014 Prof. Leal-Taixé and Prof. Niessner 39

  40. How to deal l wit ith im images? Prof. Leal-Taixé and Prof. Niessner 40

  41. Usin ing CNNs in in Computer Vis ision Prof. Leal-Taixé and Prof. Niessner 41 Credit: Li/Karpathy/Johnson

  42. Im Image fi filt lters • Each kernel gives us a different image filter Box mean Edge detection 1 1 1 −1 −1 −1 Input 1 1 1 1 −1 8 −1 9 1 1 1 −1 −1 −1 Sharpen Gaussian blur 0 −1 0 1 2 1 1 −1 5 −1 2 4 2 16 0 −1 0 1 2 1 Prof. Leal-Taixé and Prof. Niessner 42

  43. Convolutio ions on RGB Im Images 32 × 32 × 3 image (pixels 𝑦 ) activation map 5 × 5 × 3 filter (weights 𝑥 ) (also feature map) 28 32 5 Co Convolve 5 slide over all spatial locations 𝑦 𝑗 3 and compute all output 𝑨 𝑗 ; 28 w/o padding, there are 28 × 28 locations 32 1 3 Prof. Leal-Taixé and Prof. Niessner 43

  44. Convolutio ion Layer 32 × 32 × 3 image Convolution “Layer” activation maps 32 28 Convolve Co Let’s apply **five** * filt lters, ea each ch wit ith dif iffe ferent wei eights! 28 32 5 3 Prof. Leal-Taixé and Prof. Niessner 44

  45. CNN Pro rototype ConvNet is concatenation of Conv Layers and activations Input Image Conv + Conv + Conv + ReLU ReLU 24 ReLU 28 32 5 filters 8 filters 12 filters 5 × 5 × 3 5 × 5 × 5 5 × 5 × 8 20 24 28 32 12 8 5 3 Prof. Leal-Taixé and Prof. Niessner 45

  46. CNN le learned fi filt lters Prof. Leal-Taixé and Prof. Niessner 46

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend