Hyper-parameters/Tweaking Yufeng Ma, Chris Dusold Virginia Tech - PowerPoint PPT Presentation

Key Points in Batch Normalization Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

Key Points in Batch Normalization Original parameters and newly introduced γ and β will be trained. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

Key Points in Batch Normalization Original parameters and newly introduced γ and β will be trained. When in inference, the whole population of training data is used for mean and variance statistics instead of the estimate. E ( x ) ← E B [ µ B ] m m − 1 E B [ σ 2 Var [ x ] ← B ] Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

Key Points in Batch Normalization Original parameters and newly introduced γ and β will be trained. When in inference, the whole population of training data is used for mean and variance statistics instead of the estimate. E ( x ) ← E B [ µ B ] m m − 1 E B [ σ 2 Var [ x ] ← B ] In Convolutional layers, different locations of a feature map should be normalized in the same way. m ′ = |B| = m · pq , and γ ( k ) , β ( k ) per feature map Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

Key Points in Batch Normalization Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

Key Points in Batch Normalization Higher learning rates are allowed Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

Key Points in Batch Normalization Higher learning rates are allowed BN ( Wu ) = BN (( aW ) u ) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

Key Points in Batch Normalization Higher learning rates are allowed BN ( Wu ) = BN (( aW ) u ) ∂ BN ( Wu ) = ∂ BN (( aW ) u ) , ∂ BN ( Wu ) = 1 a · ∂ BN (( aW ) u ) ∂ u ∂ u ∂ aW ∂ W Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

Key Points in Batch Normalization Higher learning rates are allowed BN ( Wu ) = BN (( aW ) u ) ∂ BN ( Wu ) = ∂ BN (( aW ) u ) , ∂ BN ( Wu ) = 1 a · ∂ BN (( aW ) u ) ∂ u ∂ u ∂ aW ∂ W Batch Normalization will regularize the model with less overfitting. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

Overview Batch Normalization 1 Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results Importance of Initialization and Momentum 2 Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 15 / 40

Activations over time Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 16 / 40

Activations over time Batch Normalization helps train faster and achieve higher accuracy. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 16 / 40

Activations over time Batch Normalization helps train faster and achieve higher accuracy. figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 16 / 40

Activations over time Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 17 / 40

Activations over time Batch Normalization makes input distribution more stable. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 17 / 40

Activations over time Batch Normalization makes input distribution more stable. figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 17 / 40

Accelerating Batch Normalization Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 18 / 40

Accelerating Batch Normalization Networks Tricks to follow Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 18 / 40

Accelerating Batch Normalization Networks Tricks to follow Increasing learning rate Remove or Reduce Dropout Reduce ℓ 2 weight regularization Accelerate the learning rate decay Remove Local Response Normalization Shuffle training examples more thoroughly Reduce the photometric distortions Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 18 / 40

Network Comparisons Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40

Network Comparisons Inception, BN-Baseline, BN-x5, BN-x30, BN-x5-Sigmoid Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40

Network Comparisons Inception, BN-Baseline, BN-x5, BN-x30, BN-x5-Sigmoid figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40

Ensemble Classification Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 20 / 40

Ensemble Classification Top-5 validation error of 4.9% and test error of 4.82%, exceeds the estimated accuracy of human raters. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 20 / 40

Ensemble Classification Top-5 validation error of 4.9% and test error of 4.82%, exceeds the estimated accuracy of human raters. figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 20 / 40

Challenges to be solved Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Difficult to use first-order method to reach performance previously only achievable by second-order method like Hessian-Free. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Difficult to use first-order method to reach performance previously only achievable by second-order method like Hessian-Free. Well-designed random initialization Slowly increasing schedule for momentum parameter Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

Challenges to be solved Reference paper: On the importance of initialization and momentum in deep learning Difficult to use first-order method to reach performance previously only achievable by second-order method like Hessian-Free. Well-designed random initialization Slowly increasing schedule for momentum parameter No need for sophisticated second-order methods. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

Overview of first-order method First-order Methods Vanilla Stochastic Gradient Descent SGD + Momentum Nesterov’s Accelerated Gradient(NAG) AdaGrad Adam Rprop RMSProp AdaDelta slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 23 / 40

Several First-order Methods Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 25 / 40

Several First-order Methods Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 25 / 40

Several First-order Methods Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Vanilla SGD v t +1 = ǫ ▽ f ( θ t ) θ t +1 = θ t − v t +1 slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 25 / 40

Several First-order Methods Rprop Update if ▽ f t ▽ f t − 1 > 0 v t = η + v t − 1 else if ▽ f t ▽ f t − 1 < 0 v t = η − v t − 1 else v t = v t − 1 θ t +1 = θ t − v t where 0 < η − < 1 < η + slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 26 / 40

Several First-order Methods AdaGrad r t = θ 2 t + r t − 1 α v t +1 = √ r t ▽ f ( θ t ) θ t +1 = θ t − v t +1 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 27 / 40

Several First-order Methods AdaGrad r t = θ 2 t + r t − 1 α v t +1 = √ r t ▽ f ( θ t ) θ t +1 = θ t − v t +1 RMSProp = Rprop + SGD r t = (1 − γ ) θ 2 t + γ r t − 1 α v t +1 = √ r t ▽ f ( θ t ) θ t +1 = θ t − v t +1 slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 27 / 40

Several First-order Methods AdaDelta v t +1 = H − 1 ▽ f , ∝ f ′ f ′′ 1 / units of θ ∝ (1 / units of θ ) 2 ∝ units of θ Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 28 / 40

Several First-order Methods AdaDelta v t +1 = H − 1 ▽ f , ∝ f ′ f ′′ 1 / units of θ ∝ (1 / units of θ ) 2 ∝ units of θ Adam r t = (1 − γ 1 ) ▽ f ( θ t ) + γ 1 r t − 1 p t = (1 − γ 2 ) ▽ f ( θ t ) 2 + γ 2 p t − 1 r t r t = ˆ (1 − (1 − γ 1 ) t ) p t p t = ˆ (1 − (1 − r 2 ) t ) v t = α ˆ r t √ ˆ p t θ t +1 = θ t − v t slide credit: Ishan Misra Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 28 / 40

Momentum and NAG Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

Momentum and NAG Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

Momentum and NAG Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Classical Momentum v t +1 = µ v t − ǫ ▽ f ( θ t ) θ t +1 = θ t + v t +1 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

Momentum and NAG Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽ f - Gradient of f , v - Velocity vector, µ - Momentum coefficient Classical Momentum v t +1 = µ v t − ǫ ▽ f ( θ t ) θ t +1 = θ t + v t +1 Nesterov’s Accelerated Gradient v t +1 = µ v t − ǫ ▽ f ( θ t + µ v t ) θ t +1 = θ t + v t +1 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

Relationship between CM and NAG Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40

Relationship between CM and NAG NAG uses θ t + µ v t but MISSING the yet unknown correction. Thus when the addition of µ v t results in an immediate undesirable increase in the objective f , Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40

Relationship between CM and NAG NAG uses θ t + µ v t but MISSING the yet unknown correction. Thus when the addition of µ v t results in an immediate undesirable increase in the objective f , figure credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40

Relationship between CM and NAG Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 32 / 40

Relationship between CM and NAG Apply CM and NAG to a positive definite quadratic objective q ( x ) = x T Ax / 2 + b T x . Difference in effective momentum coefficient Classical Momentum: µ NAG: µ (1 − λǫ ), where λ is the eigenvalue of A . Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 32 / 40

Relationship between CM and NAG Apply CM and NAG to a positive definite quadratic objective q ( x ) = x T Ax / 2 + b T x . Difference in effective momentum coefficient Classical Momentum: µ NAG: µ (1 − λǫ ), where λ is the eigenvalue of A . ǫ small, CM and NAG are equivalent ǫ large, NAG gives smaller µ (1 − λ i ǫ ) to stop oscillations. Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 32 / 40

Deep Autoencoders Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 34 / 40

Deep Autoencoders Structure of Deep Autoencoder Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 34 / 40

Deep Autoencoders Structure of Deep Autoencoder figure credit: http://deeplearning4j.org/deepautoencoder.html Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 34 / 40

Deep Autoencoders Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

Deep Autoencoders Sparse Initialization -each random unit connected to 15 randomly chosen units in the previous layer, drawn from a unit Gaussian Schedule for Momentum Coefficient µ t = min (1 − 2 − 1 − log 2 ( ⌊ t / 250 ⌋ +1) , µ max ) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

Deep Autoencoders Sparse Initialization -each random unit connected to 15 randomly chosen units in the previous layer, drawn from a unit Gaussian Schedule for Momentum Coefficient µ t = min (1 − 2 − 1 − log 2 ( ⌊ t / 250 ⌋ +1) , µ max ) µ t = 1 − 3 / ( t + 5), not strongly convex - Nesterov(1983) constant µ t , strongly convex - Nesterov(2003) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

Deep Autoencoders Sparse Initialization -each random unit connected to 15 randomly chosen units in the previous layer, drawn from a unit Gaussian Schedule for Momentum Coefficient µ t = min (1 − 2 − 1 − log 2 ( ⌊ t / 250 ⌋ +1) , µ max ) µ t = 1 − 3 / ( t + 5), not strongly convex - Nesterov(1983) constant µ t , strongly convex - Nesterov(2003) table credit: reference paper Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 36 / 40

RNN - Echo-State Networks Echo-State Networks (a family RNNs) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 36 / 40

RNN - Echo-State Networks Echo-State Networks (a family RNNs) figure credit: Mantas Lukoevicius Hidden-to-output connections learned from data Recurrent connections fixed to a random draw from a specific distribution Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 36 / 40

RNN - Echo-State Networks Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 37 / 40

RNN - Echo-State Networks ESN-based Initialization Spectral Radius of Hidden-to-hidden matrix around 1.1 Initial scale of Input-to-hidden connections plays an important role (Gaussian draw with a standard deviation of 0.001 achieves good balance, but is Task Dependent) Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 37 / 40

Hyper-parameters/Tweaking Yufeng Ma, Chris Dusold Virginia Tech - PowerPoint PPT Presentation

Hyper-parameters/Tweaking Yufeng Ma, Chris Dusold Virginia Tech November 17, 2015 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 1 / 40 Overview Batch Normalization 1 Internal Covariate Shift

Hyper: Make VM Runs Like Container Xu Wang <xu@hyper.sh> Hyper HQ Agenda Lesson

From HyPer to Hyper Integrating an academic DBMS into a leading analytics and business

Tweaking structures: working on the fiddly bits Kevin Karplus karplus@soe.ucsc.edu Biomolecular

Vembu extends support to Vembu extends support to Vembu v4.0 Hyper-V Cluster with v4.0 Agenda

Hyper-Resolution AUTOMATED REASONING Hyper-resolution generalises ``bottom- (electron) up

Hyper-Resolution AUTOMATED REASONING Hyper-resolution is the strategy employed (electron) in the

1 Hyper-heuristics: Raising the Level of Generality of Search Hyper-heuristics: Raising the Level

Status of the Hyper- Kamiokande Experiment Erin OSullivan, on behalf of the Hyper-Kamiokande

Camera Parameters INEL 6088 Computer Vision Camera Parameters Extrinsic parameters: define

Hyper-scaling on Openstack with Open Source tooling A use case in deploying hyper-scale grid

Tra racking Hyper Bo cking Hyper Boosted sted Top Q p Qua uarks @ 100 T rks @ 100 TeV eV

Hyper-Vacancy in a census tract More than 10 percent of housing units in this category Cuyahoga

Hyper-K David Hadley, University of Warwick Outline Hyper-K Detector Long baseline neutrino

Outreach for Hyper-Kamiokande Jost Migenda (they/them) for the Hyper-Kamiokande

Indirect WIMP detection with neutrinos in Hyper-K Yusuke Koshio for Hyper-K astrophysics working

Atmospheric Neutrino Studies at Hyper-K Advanced Workshop on Physics of Atmospheric Neutrinos -

Developing a RESTful Web application for Liberty in CICS Introduction Course introduction What

SWEN 256 Software Process & Project Management Not everything that can be

Welcome! A Case for Response Time Focused Query Processing Olaf Hartjg

CSCI 3 3 4 2 I nternet Program m ing MW , 3 :0 5 pm 4 :2 0 pm ACSB 2 .1 1 3 Emmett Tomai

Neural Autoregressive Distribution Estimation Instructor: John Thickstun Discussion Board:

Boo ook Acq cquis isitio ion in n the the (ne (new) era er a of of dig digit ital l

Second Order Adjoints with the NAG Fortran 95 Compiler Uwe Naumann, Michael Maier RWTH Aachen,

On the Verification of Synthesized Kalman Filters Ruben Gamboa, John Cowles, Jeff Van Baalen

Sambuz

Useful Links

Newsletter

Mail Us

Hyper-parameters/Tweaking Yufeng Ma, Chris Dusold Virginia Tech - PowerPoint PPT Presentation

Hyper-parameters/Tweaking Yufeng Ma, Chris Dusold Virginia Tech November 17, 2015 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 1 / 40 Overview Batch Normalization 1 Internal Covariate Shift

Hyper: Make VM Runs Like Container Xu Wang &lt;xu@hyper.sh&gt; Hyper HQ Agenda Lesson

From HyPer to Hyper Integrating an academic DBMS into a leading analytics and business

Tweaking structures: working on the fiddly bits Kevin Karplus karplus@soe.ucsc.edu Biomolecular

Vembu extends support to Vembu extends support to Vembu v4.0 Hyper-V Cluster with v4.0 Agenda

Hyper-Resolution AUTOMATED REASONING Hyper-resolution generalises ``bottom- (electron) up

Hyper-Resolution AUTOMATED REASONING Hyper-resolution is the strategy employed (electron) in the

1 Hyper-heuristics: Raising the Level of Generality of Search Hyper-heuristics: Raising the Level

Status of the Hyper- Kamiokande Experiment Erin OSullivan, on behalf of the Hyper-Kamiokande

Camera Parameters INEL 6088 Computer Vision Camera Parameters Extrinsic parameters: define

Hyper-scaling on Openstack with Open Source tooling A use case in deploying hyper-scale grid

Tra racking Hyper Bo cking Hyper Boosted sted Top Q p Qua uarks @ 100 T rks @ 100 TeV eV

Hyper-Vacancy in a census tract More than 10 percent of housing units in this category Cuyahoga

Hyper-K David Hadley, University of Warwick Outline Hyper-K Detector Long baseline neutrino

Outreach for Hyper-Kamiokande Jost Migenda (they/them) for the Hyper-Kamiokande

Indirect WIMP detection with neutrinos in Hyper-K Yusuke Koshio for Hyper-K astrophysics working

Atmospheric Neutrino Studies at Hyper-K Advanced Workshop on Physics of Atmospheric Neutrinos -

Developing a RESTful Web application for Liberty in CICS Introduction Course introduction What

SWEN 256 Software Process &amp; Project Management Not everything that can be

Welcome! A Case for Response Time Focused Query Processing Olaf Hartjg

CSCI 3 3 4 2 I nternet Program m ing MW , 3 :0 5 pm 4 :2 0 pm ACSB 2 .1 1 3 Emmett Tomai

Neural Autoregressive Distribution Estimation Instructor: John Thickstun Discussion Board:

Boo ook Acq cquis isitio ion in n the the (ne (new) era er a of of dig digit ital l

Second Order Adjoints with the NAG Fortran 95 Compiler Uwe Naumann, Michael Maier RWTH Aachen,

On the Verification of Synthesized Kalman Filters Ruben Gamboa, John Cowles, Jeff Van Baalen

Sambuz

Useful Links

Newsletter

Mail Us

Hyper: Make VM Runs Like Container Xu Wang <xu@hyper.sh> Hyper HQ Agenda Lesson

SWEN 256 Software Process & Project Management Not everything that can be