Train inin ing Neural l Networks I2DL: Prof. Niessner, Prof. - PowerPoint PPT Presentation

Train inin ing Neural l Networks I2DL: Prof. Niessner, Prof. Leal-Taixé 1

Lecture 5 Recap I2DL: Prof. Niessner, Prof. Leal-Taixé 2

Gra radie ient Descent fo for r Neura ral Networks Loss function 𝜖𝑔 𝑧 𝑗 − 𝑧 𝑗 2 𝑀 𝑗 = ො 𝜖𝑥 0,0,0 … ℎ 0 … 𝑦 0 𝜖𝑔 𝑧 0 ො ℎ 1 𝑧 0 𝛼 𝑿,𝒄 𝑔 𝒚,𝒛 (𝑿) = 𝜖𝑥 𝑚,𝑛,𝑜 𝑦 1 … 𝑧 1 ො ℎ 2 𝑧 1 … 𝜖𝑔 𝑦 2 ℎ 3 𝜖𝑐 𝑚,𝑛 𝑧 𝑗 = 𝐵(𝑐 1,𝑗 + ෍ ො ℎ 𝑘 𝑥 1,𝑗,𝑘 ) 𝑘 Just simple: ℎ 𝑘 = 𝐵(𝑐 0,𝑘 + ෍ 𝑦 𝑙 𝑥 0,𝑘,𝑙 ) 𝐵 𝑦 = max(0, 𝑦) 𝑙 I2DL: Prof. Niessner, Prof. Leal-Taixé 3

Stochastic Gra radient Descent (S (SGD) 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽𝛼 𝜾 𝑀(𝜾 𝑙 , 𝒚 {1..𝑛} , 𝒛 {1..𝑛} ) 𝑙 now refers to 𝑙 -th iteration 𝑛 𝛼 𝜾 𝑀 𝑗 1 𝑛 σ 𝑗=1 𝛼 𝜾 𝑀 = 𝑛 training samples in the current minibatch Gradient for the 𝑙 -th minibatch : I2DL: Prof. Niessner, Prof. Leal-Taixé 4

Gra radie ient Descent wit ith Momentum 𝒘 𝑙+1 = 𝛾 ⋅ 𝒘 𝑙 + 𝛼 𝜾 𝑀(𝜾 𝑙 ) accumulation rate Gradient of current minibatch velocity (‘friction’, momentum) 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽 ⋅ 𝒘 𝑙+1 velocity model learning rate Exponentially-weighted average of gradient Important: velocity 𝒘 𝑙 is vector-valued! I2DL: Prof. Niessner, Prof. Leal-Taixé 5

RMSProp Large gradients Y-Direction Source: A. Ng X-direction Small gradients (Uncentered) variance of gradients 𝒕 𝑙+1 = 𝛾 ⋅ 𝒕 𝑙 + (1 − 𝛾)[𝛼 𝜾 𝑀 ∘ 𝛼 𝜾 𝑀] → second momentum We’re dividing by square gradients: 𝛼 𝜾 𝑀 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽 ⋅ - Division in Y-Direction will be 𝒕 𝑙+1 + 𝜗 large - Division in X-Direction will be Can increase learning rate! small I2DL: Prof. Niessner, Prof. Leal-Taixé 6

Adam • Combines Momentum and RMSProp 𝒏 𝑙+1 = 𝛾 1 ⋅ 𝒏 𝑙 + 1 − 𝛾 1 𝛼 𝜾 𝑀 𝜾 𝑙 𝒘 𝑙+1 = 𝛾 2 ⋅ 𝒘 𝑙 + (1 − 𝛾 2 )[𝛼 𝜾 𝑀 𝜾 𝑙 ∘ 𝛼 𝜾 𝑀 𝜾 𝑙 • 𝒏 𝑙+1 and 𝒘 𝑙+1 are initialized with zero → bias towards zero → Typically, bias-corrected moment updates 𝒏 𝑙+1 𝒘 𝑙+1 𝒏 𝑙+1 𝒏 𝑙+1 = 𝒘 𝑙+1 = 𝜾 𝑙+1 = 𝜾 𝑙 − 𝛽 ⋅ ෝ ෝ ෝ 𝑙+1 𝑙+1 1 − 𝛾 1 1 − 𝛾 2 𝒘 𝑙+1 +𝜗 ෝ I2DL: Prof. Niessner, Prof. Leal-Taixé 7

Train inin ing Neural l Nets I2DL: Prof. Niessner, Prof. Leal-Taixé 8

Learnin ing Rate: : Im Implic ications • What if too high? • What if too low? Source: http://cs231n.github.io/neural-networks-3/ I2DL: Prof. Niessner, Prof. Leal-Taixé 9

Learnin ing Rate Need high learning rate when far away Need low learning rate when close I2DL: Prof. Niessner, Prof. Leal-Taixé 10

Learnin ing Rate Decay 1 • 𝛽 = 1+𝑒𝑓𝑑𝑏𝑧_𝑠𝑏𝑢𝑓∗𝑓𝑞𝑝𝑑ℎ ⋅ 𝛽 0 – E.g., 𝛽 0 = 0.1 , 𝑒𝑓𝑑𝑏𝑧_𝑠𝑏𝑢𝑓 = 1.0 Learning Rate over Epochs 0.12 → Epoch 0: 0.1 0.1 → Epoch 1: 0.05 0.08 → Epoch 2: 0.06 0.033 0.04 → Epoch 3: 0.025 0.02 ... 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 I2DL: Prof. Niessner, Prof. Leal-Taixé 11

Learnin ing Rate Decay Many options: • Step decay 𝛽 = 𝛽 − 𝑢 ⋅ 𝛽 (only every n steps) – T is decay rate (often 0.5) • Exponential decay 𝛽 = 𝑢 𝑓𝑞𝑝𝑑ℎ ⋅ 𝛽 0 – t is decay rate (t < 1.0) 𝑢 • 𝛽 = 𝑓𝑞𝑝𝑑ℎ ⋅ 𝑏 0 – t is decay rate • Etc. I2DL: Prof. Niessner, Prof. Leal-Taixé 12

Tra rain ining Schedule le Manually specify learning rate for entire training process • Manually set learning rate every n-epochs • How? – Trial and error (the hard way) – Some experience (only generalizes to some degree) Consider: #epochs, training set size, network size, etc. I2DL: Prof. Niessner, Prof. Leal-Taixé 13

Basic Recip ipe fo for r Tra rain ining • Given ground dataset with ground labels – {𝑦 𝑗 , 𝑧 𝑗 } • 𝑦 𝑗 is the 𝑗 𝑢ℎ training image, with label 𝑧 𝑗 • Often dim 𝑦 ≫ dim(𝑧) (e.g., for classification) • 𝑗 is often in the 100-thousands or millions – Take network 𝑔 and its parameters 𝑥, 𝑐 – Use SGD (or variation) to find optimal parameters 𝑥, 𝑐 • Gradients from backpropagation I2DL: Prof. Niessner, Prof. Leal-Taixé 14

Gra radie ient Descent on Tra rain in Set • Given large train set with ( 𝑜 ) training samples {𝒚 𝑗 , 𝒛 𝑗 } – Let’s say 1 million labeled images – Let’s say our network has 500k parameters • Gradient has 500k dimensions • 𝑜 = 1 𝑛𝑗𝑚𝑚𝑗𝑝𝑜 • Extr xtremely ly exp xpensive to to compute I2DL: Prof. Niessner, Prof. Leal-Taixé 15

Learnin ing • Learning means generalization to unknown dataset – (So far no ‘real’ learning) – I.e., train on known dataset → test with optimized parameters on unknown dataset • Basically, we hope that based on the train set, the optimized parameters will give similar results on different data (i.e., test data) I2DL: Prof. Niessner, Prof. Leal-Taixé 16

Learnin ing • Training set (‘ train ’): – Use for training your neural network • Validation set (‘ val ’): – Hyperparameter optimization – Check generalization progress • Test set (‘ test ’): – Only for the very end – NEVER TO TOUCH DURIN ING DEVELOPMENT OR TR TRAINING I2DL: Prof. Niessner, Prof. Leal-Taixé 17

Learnin ing • Typical splits – Train (60%), Val (20%), Test (20%) – Train (80%), Val (10%), Test (10%) • During training: – Train error comes from average minibatch error – Typically take subset of validation every n iterations I2DL: Prof. Niessner, Prof. Leal-Taixé 18

Basic Recip ipe fo for r Machine Learning • Split your data 60% 20% 20% validation test train Find your hyperparameters I2DL: Prof. Niessner, Prof. Leal-Taixé 19

Basic Recip ipe fo for r Machine Learning • Split your data 60% 20% 20% validation test train Example scenario Bias Ground truth error …... 1% (underfitting) Training set error ….... 5% Variance (overfitting) Val/test set error ….... 8% I2DL: Prof. Niessner, Prof. Leal-Taixé 20

Basic Recip ipe fo for r Machine Learning Done Credits: A. Ng I2DL: Prof. Niessner, Prof. Leal-Taixé 21

Over- and Underf rfitting Underfitted Appropriate Overfitted Source: Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017 I2DL: Prof. Niessner, Prof. Leal-Taixé 22

Over- and Underf rfitting Source: https://srdas.github.io/DLBook/ImprovingModelGeneralization.html I2DL: Prof. Niessner, Prof. Leal-Taixé 23

Learnin ing Curv rves • Training graphs - Accuracy - Loss I2DL: Prof. Niessner, Prof. Leal-Taixé 24

Learnin ing Curv rves val t e s t Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ I2DL: Prof. Niessner, Prof. Leal-Taixé 25

Overf rfit ittin ing Curv rves Val t e s t Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ I2DL: Prof. Niessner, Prof. Leal-Taixé 26

Other r Curv rves Validation Set is easier than Training set Underfitting (loss still decreasing) Source: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ I2DL: Prof. Niessner, Prof. Leal-Taixé 27

To Summariz ize • Underfitting – Training and validation losses decrease even at the end of training • Overfitting – Training loss decreases and validation loss increases • Ideal Training – Small gap between training and validation loss, and both go down at same rate (stable without fluctuations). I2DL: Prof. Niessner, Prof. Leal-Taixé 28

To Summariz ize • Bad Signs – Training error not going down – Validation error not going down – Performance on validation better than on training set – Tests on train set different than during training • Bad Practice Never touch during – Training set contains test data development or – Debug algorithm on test data training I2DL: Prof. Niessner, Prof. Leal-Taixé 29

Hyperparameters • Network architecture (e.g., num layers, #weights) • Number of iterations • Learning rate(s) (i.e., solver parameters, decay, etc.) • Regularization (more later next lecture) • Batch size • … • Overall: learning setup + optimization = hyperparameters I2DL: Prof. Niessner, Prof. Leal-Taixé 30

Hyperparameter Tunin ing Grid search • Methods: 1 Second Parameter 0.8 – Manual search: 0.6 • most common  0.4 – Gri rid search (structured, for ‘real’ applications) 0.2 0 • Define ranges for all parameters spaces and 0 0.2 0.4 0.6 0.8 1 select points First Parameter Random search • Usually pseudo-uniformly distributed 1 → Iterate over all possible configurations Second Parameter 0.8 – Rand ndom search: 0.6 0.4 Like grid search but one picks points at random 0.2 in the predefined ranges 0 0 0.2 0.4 0.6 0.8 1 First Parameter I2DL: Prof. Niessner, Prof. Leal-Taixé 31

Train inin ing Neural l Networks I2DL: Prof. Niessner, Prof. - PowerPoint PPT Presentation

Train inin ing Neural l Networks I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 5 Recap I2DL: Prof. Niessner, Prof. Leal-Taix 2 Gra radie ient Descent fo for r Neura ral Networks Loss function 2

23 Advanced Topics 5: Multi-lingual Models Up until now, we have assumed that in the case of

WE WELCOME ME REFEREE T TRAIN ININ ING SPRIN ING 2 2019 Responsibilities EAST BAY FLAG

TOS Arno Puder 1 Objectives Introduce the train simulator Using the model train

Onlin line In Information Se Sessi sion In Introduction to o Train inin ing Tools ls on

Spelling, Punctuation and Grammar Suffixes -ing Year One SPaG | Suffixes -ing Suffixes Suffixes

SMART fit Functio ional l and Br Brain in Fitn itness Train inin ing Ba Based on Gamif

SMART fit Functio ional l and Br Brain in Fitn itness Train inin ing Ba Based on Gamif

Meeting Agenda 6:30 Open House 7:00 Presentation & Discussion media ia t train inin ing

Cult ltural l Competen ency and L Lingu guistic S Services es Train inin ing Cultural

Vib ibratio ion In Instit itute Pie iedmont Chapter 2018 Train inin ing Event Robert J.

media ia t train inin ing Official Community Plan Amendment, TELLING O OUR S R STORY RY

New Case Management System for Chancery Division Train inin ing g Webi binar nar August st

New Case Management System for County Division Train inin ing g Webi binar nar August st

StartV_Start a Volunteering Programme The Train inin ing The way to start a Volunteering

Patr tric ick El Element ntary ary Sch choo ool Vol olunteer er Train inin ing School

Th The Im Importance of Contin inued Rese search Train inin ing & Avail ilable le Reso

The diversity of inflectional periphrasis in Persian Olivier Bonami 1 Pollet Samvelian 2 1 U.

What is Judgment? The definition of judgment The definition of proposition Elements of a

Variation on Preschoolers Acquisition of Questions Student Author: Saundra Scott Mentor

Statistical Bias Correction of hydrological forcing fields from GCMs: basic concepts . Claudio

Extraction of moments of net-particle event-by-event fluctuations in the CBM experiment V.

Computing Nucleon Electric Dipole Moments in Lattice QCD Hiroshi Ohki Nara Womens University

Enrique Soriano Martn Universidad Politcnica de Madrid 3rd International Electronic

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Train inin ing Neural l Networks I2DL: Prof. Niessner, Prof. - PowerPoint PPT Presentation

Train inin ing Neural l Networks I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 5 Recap I2DL: Prof. Niessner, Prof. Leal-Taix 2 Gra radie ient Descent fo for r Neura ral Networks Loss function 2

23 Advanced Topics 5: Multi-lingual Models Up until now, we have assumed that in the case of

WE WELCOME ME REFEREE T TRAIN ININ ING SPRIN ING 2 2019 Responsibilities EAST BAY FLAG

TOS Arno Puder 1 Objectives Introduce the train simulator Using the model train

Onlin line In Information Se Sessi sion In Introduction to o Train inin ing Tools ls on

Spelling, Punctuation and Grammar Suffixes -ing Year One SPaG | Suffixes -ing Suffixes Suffixes

SMART fit Functio ional l and Br Brain in Fitn itness Train inin ing Ba Based on Gamif

SMART fit Functio ional l and Br Brain in Fitn itness Train inin ing Ba Based on Gamif

Meeting Agenda 6:30 Open House 7:00 Presentation &amp; Discussion media ia t train inin ing

Cult ltural l Competen ency and L Lingu guistic S Services es Train inin ing Cultural

Vib ibratio ion In Instit itute Pie iedmont Chapter 2018 Train inin ing Event Robert J.

media ia t train inin ing Official Community Plan Amendment, TELLING O OUR S R STORY RY

New Case Management System for Chancery Division Train inin ing g Webi binar nar August st

New Case Management System for County Division Train inin ing g Webi binar nar August st

StartV_Start a Volunteering Programme The Train inin ing The way to start a Volunteering

Patr tric ick El Element ntary ary Sch choo ool Vol olunteer er Train inin ing School

Th The Im Importance of Contin inued Rese search Train inin ing &amp; Avail ilable le Reso

The diversity of inflectional periphrasis in Persian Olivier Bonami 1 Pollet Samvelian 2 1 U.

What is Judgment? The definition of judgment The definition of proposition Elements of a

Variation on Preschoolers Acquisition of Questions Student Author: Saundra Scott Mentor

Statistical Bias Correction of hydrological forcing fields from GCMs: basic concepts . Claudio

Extraction of moments of net-particle event-by-event fluctuations in the CBM experiment V.

Computing Nucleon Electric Dipole Moments in Lattice QCD Hiroshi Ohki Nara Womens University

Enrique Soriano Martn Universidad Politcnica de Madrid 3rd International Electronic

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Meeting Agenda 6:30 Open House 7:00 Presentation & Discussion media ia t train inin ing

Th The Im Importance of Contin inued Rese search Train inin ing & Avail ilable le Reso