Basic ics of f DL Prof. Leal-Taixé and Prof. Niessner 1
What we assume you know • Linear Algebra & Programming! • Basics from the Introduction to Deep Learning lecture • PyTorch (can use TensorFlow …) • You have trained already several models and know how to debug problems, observe training curves, prepare training/validation/test data. Prof. Leal-Taixé and Prof. Niessner 2
What is is a N Neural l Network? Prof. Leal-Taixé and Prof. Niessner 3
Neural Network • Linear score function 𝑔 = 𝑋𝑦 On CIFAR-10 Credit: Li/Karpathy/Johnson On ImageNet Prof. Leal-Taixé and Prof. Niessner 4
Neural Network • Linear score function 𝑔 = 𝑋𝑦 • Neural network is a nesting of ‘functions’ – 2-layers: 𝑔 = 𝑋 2 max(0, 𝑋 1 𝑦) – 3-layers: 𝑔 = 𝑋 3 max(0, 𝑋 2 max(0, 𝑋 1 𝑦)) – 4-layers: 𝑔 = 𝑋 4 tanh (W 3 , max(0, 𝑋 2 max(0, 𝑋 1 𝑦))) – 5-layers: 𝑔 = 𝑋 5 𝜏(𝑋 4 tanh(W 3 , max(0, 𝑋 2 max(0, 𝑋 1 𝑦)))) – … up to hundreds of layers Prof. Leal-Taixé and Prof. Niessner 5
Neural Network 1-layer network: 𝑔 = 𝑋𝑦 2-layer network: 𝑔 = 𝑋 2 max(0, 𝑋 1 𝑦) 𝑦 𝑦 𝑋 𝑔 𝑋 𝑋 𝑔 ℎ 1 2 128 × 128 = 16384 1000 10 128 × 128 = 16384 10 Prof. Leal-Taixé and Prof. Niessner 6
Neural Network Credit: Li/Karpathy/Johnson Prof. Leal-Taixé and Prof. Niessner 7
Loss functio ions Prof. Leal-Taixé and Prof. Niessner 8
Neural networks What is the shape of this function? Loss (Softmax, Hinge) Prediction Prof. Leal-Taixé and Prof. Niessner 9
Loss fu functio ions Evaluate the ground • Softmax loss function truth score for the image • Hinge Loss (derived from the Multiclass SVM loss) Prof. Leal-Taixé and Prof. Niessner 10
Loss fu functio ions • Softmax loss function – Optimizes until the loss is zero • Hinge Loss (derived from the Multiclass SVM loss) – Saturates whenever it has learned a class “well enough” Prof. Leal-Taixé and Prof. Niessner 11
Activ ivatio ion functio ions Prof. Leal-Taixé and Prof. Niessner 12
Sig igmoid id Forward Saturated neurons kill the gradient flow Prof. Leal-Taixé and Prof. Niessner 13
Pro roblem of f positiv ive output More on zero- mean data later Prof. Leal-Taixé and Prof. Niessner 14
tanh Still saturates Zero- centered Still saturates LeCun 1991 Prof. Leal-Taixé and Prof. Niessner 15
Rectif ifie ied Lin inear Unit its (ReLU) Dead ReLU Large and What happens if a consistent ReLU outputs zero? gradients Fast convergence Does not saturate Prof. Leal-Taixé and Prof. Niessner 16
Maxout unit its Linear Generalization Does not Does not regimes of ReLUs die saturate Increase of the number of parameters Prof. Leal-Taixé and Prof. Niessner 17
Optim imiz izatio ion Prof. Leal-Taixé and Prof. Niessner 18
Gra radie ient Descent fo for r Neura ral Networks ℎ 0 𝜖𝑔 𝑦 0 𝑧 0 𝑢 0 ℎ 1 𝜖𝑥 0,0,0 … 𝑦 1 … 𝑧 1 𝑢 1 ℎ 2 𝜖𝑔 𝛼 𝑥,𝑐 𝑔 𝑦,𝑢 (𝑥) = 𝑦 2 𝜖𝑥 𝑚,𝑛,𝑜 … ℎ 3 … 𝜖𝑔 𝑀 𝑗 = 𝑧 𝑗 − 𝑢 𝑗 2 𝜖𝑐 𝑚,𝑛 𝑧 𝑗 = 𝐵(𝑐 1,𝑗 + ℎ 𝑘 𝑥 1,𝑗,𝑘 ) ℎ 𝑘 = 𝐵(𝑐 0,𝑘 + 𝑦 𝑙 𝑥 0,𝑘,𝑙 ) 𝑘 𝑙 Just simple: 𝐵 𝑦 = max(0, 𝑦) Prof. Leal-Taixé and Prof. Niessner 19
Stochastic Gra radient Descent (S (SGD) 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽𝛼 𝜄 𝑀(𝜄 𝑙 , 𝑦 {1..𝑛} , 𝑧 {1..𝑛} ) 𝑛 𝛼 𝜄 𝑀 𝑗 1 𝑛 σ 𝑗=1 𝛼 𝜄 𝑀 = 𝑙 now refers to 𝑙 -th iteration 𝑛 training samples in the current batch Gradient for the 𝑙 -th batch Note the terminology: iteration vs epoch 20 Prof. Leal-Taixé and Prof. Niessner
Gra radie ient Descent wit ith Momentum 𝑤 𝑙+1 = 𝛾 ⋅ 𝑤 𝑙 + 𝛼 𝜄 𝑀(𝜄 𝑙 ) accumulation rate Gradient of current minibatch velocity (‘friction’, momentum) 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 velocity model learning rate Exponentially-weighted average of gradient Important: velocity 𝑤 𝑙 is vector-valued! 21 Prof. Leal-Taixé and Prof. Niessner
Gra radie ient Descent wit ith Momentum Step will be largest when a sequence of gradients all point to the same direction Hyperparameters are 𝛽, 𝛾 𝛾 is often set to 0.9 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 Prof. Leal-Taixé and Prof. Niessner 22 Fig. credit: I. Goodfellow
RMSProp 𝑡 𝑙+1 = 𝛾 ⋅ 𝑡 𝑙 + (1 − 𝛾)[𝛼 𝜄 𝑀 ∘ 𝛼 𝜄 𝑀] Element-wise multiplication 𝛼 𝜄 𝑀 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑡 𝑙+1 + 𝜗 Hyperparameters: 𝛽 , 𝛾 , 𝜗 Needs tuning! Often 0.9 Typically 10 −8 23 Prof. Leal-Taixé and Prof. Niessner
RMSProp Large gradients Y-Direction X-direction Small gradients (uncentered) variance of gradients 𝑡 𝑙+1 = 𝛾 ⋅ 𝑡 𝑙 + (1 − 𝛾)[𝛼 𝜄 𝑀 ∘ 𝛼 𝜄 𝑀] -> second momentum 𝛼 𝜄 𝑀 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑡 𝑙+1 + 𝜗 We’re dividing by square gradients: - Division in Y-Direction will be large Can increase learning rate! - Division in X-Direction will be small Prof. Leal-Taixé and Prof. Niessner 24 Fig. credit: A. Ng
Adaptiv ive Moment Estim imatio ion (A (Adam) Combines Momentum and RMSProp First momentum: 𝑛 𝑙+1 = 𝛾 1 ⋅ 𝑛 𝑙 + 1 − 𝛾 1 𝛼 𝜄 𝑀 𝜄 𝑙 mean of gradients 𝑤 𝑙+1 = 𝛾 2 ⋅ 𝑤 𝑙 + (1 − 𝛾 2 )[𝛼 𝜄 𝑀 𝜄 𝑙 ∘ 𝛼 𝜄 𝑀 𝜄 𝑙 ] Second momentum: 𝑛 𝑙+1 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ variance of gradients 𝑤 𝑙+1 +𝜗 25 Prof. Leal-Taixé and Prof. Niessner
Adam Combines Momentum and RMSProp 𝑛 𝑙+1 and 𝑤 𝑙+1 are initialized with zero 𝑛 𝑙+1 = 𝛾 1 ⋅ 𝑛 𝑙 + 1 − 𝛾 1 𝛼 𝜄 𝑀 𝜄 𝑙 -> bias towards zero 𝑤 𝑙+1 = 𝛾 2 ⋅ 𝑤 𝑙 + (1 − 𝛾 2 )[𝛼 𝜄 𝑀 𝜄 𝑙 ∘ 𝛼 𝜄 𝑀 𝜄 𝑙 ] Typically, bias-corrected moment updates 𝑛 𝑙 𝑛 𝑙+1 = ෝ 1 − 𝛾 1 𝑤 𝑙 𝑛 𝑙+1 ෝ 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛽 ⋅ 𝑤 𝑙+1 = ො 1 − 𝛾 2 𝑤 𝑙+1 +𝜗 ො Prof. Leal-Taixé and Prof. Niessner 26
Convergence 27
Train inin ing NNs Prof. Leal-Taixé and Prof. Niessner 28
Im Importance of f Learning Rate Prof. Leal-Taixé and Prof. Niessner 29
Over- and Underf rfitting Figure extracted from Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017 Prof. Leal-Taixé and Prof. Niessner 30
Over- and Underf rfitting Source: http://srdas.github.io/DLBook/ImprovingModelGeneralization.html Prof. Leal-Taixé and Prof. Niessner 31
Basic re recip ipe fo for r machine le learnin ing • Split your data 60% 20% 20% validation test train Find your hyperparameters Prof. Leal-Taixé and Prof. Niessner 32
Basic re recip ipe fo for r machine le learnin ing Prof. Leal-Taixé and Prof. Niessner 33
Regula lariz izatio ion Prof. Leal-Taixé and Prof. Niessner 34
Regularization • Any strategy that aims to Lower In Increasing vali lidatio ion error train inin ing error Prof. Leal-Taixé and Prof. Niessner 35
Data augmentatio ion Krizhevsky 2012 Prof. Leal-Taixé and Prof. Niessner 36
Earl rly stopping Training time is also a hyperparameter Overfitting Prof. Leal-Taixé and Prof. Niessner 37
Bagging and ensemble methods • Bagging: uses k different datasets Training Set 3 Training Set 2 Training Set 1 Prof. Leal-Taixé and Prof. Niessner 38
Dro ropout • Disable a random set of neurons (typically 50%) Forward Srivastava 2014 Prof. Leal-Taixé and Prof. Niessner 39
How to deal l wit ith im images? Prof. Leal-Taixé and Prof. Niessner 40
Usin ing CNNs in in Computer Vis ision Prof. Leal-Taixé and Prof. Niessner 41 Credit: Li/Karpathy/Johnson
Im Image fi filt lters • Each kernel gives us a different image filter Box mean Edge detection 1 1 1 −1 −1 −1 Input 1 1 1 1 −1 8 −1 9 1 1 1 −1 −1 −1 Sharpen Gaussian blur 0 −1 0 1 2 1 1 −1 5 −1 2 4 2 16 0 −1 0 1 2 1 Prof. Leal-Taixé and Prof. Niessner 42
Convolutio ions on RGB Im Images 32 × 32 × 3 image (pixels 𝑦 ) activation map 5 × 5 × 3 filter (weights 𝑥 ) (also feature map) 28 32 5 Co Convolve 5 slide over all spatial locations 𝑦 𝑗 3 and compute all output 𝑨 𝑗 ; 28 w/o padding, there are 28 × 28 locations 32 1 3 Prof. Leal-Taixé and Prof. Niessner 43
Convolutio ion Layer 32 × 32 × 3 image Convolution “Layer” activation maps 32 28 Convolve Co Let’s apply **five** * filt lters, ea each ch wit ith dif iffe ferent wei eights! 28 32 5 3 Prof. Leal-Taixé and Prof. Niessner 44
CNN Pro rototype ConvNet is concatenation of Conv Layers and activations Input Image Conv + Conv + Conv + ReLU ReLU 24 ReLU 28 32 5 filters 8 filters 12 filters 5 × 5 × 3 5 × 5 × 5 5 × 5 × 8 20 24 28 32 12 8 5 3 Prof. Leal-Taixé and Prof. Niessner 45
CNN le learned fi filt lters Prof. Leal-Taixé and Prof. Niessner 46
Recommend
More recommend