Hyper-parameters/Tweaking Yufeng Ma, Chris Dusold Virginia Tech - - PowerPoint PPT Presentation

hyper parameters tweaking
SMART_READER_LITE
LIVE PREVIEW

Hyper-parameters/Tweaking Yufeng Ma, Chris Dusold Virginia Tech - - PowerPoint PPT Presentation

Hyper-parameters/Tweaking Yufeng Ma, Chris Dusold Virginia Tech November 17, 2015 Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 1 / 40 Overview Batch Normalization 1 Internal Covariate Shift


slide-1
SLIDE 1

Hyper-parameters/Tweaking

Yufeng Ma, Chris Dusold

Virginia Tech

November 17, 2015

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 1 / 40

slide-2
SLIDE 2

Overview

1

Batch Normalization Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results

2

Importance of Initialization and Momentum Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 2 / 40

slide-3
SLIDE 3

Challenges to be solved

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 3 / 40

slide-4
SLIDE 4

Challenges to be solved

Reference paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 3 / 40

slide-5
SLIDE 5

Challenges to be solved

Reference paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift When we are faced with training a Deep Network with saturating nonlinearities: Lower/smaller learning rates Initialize the weights from Gaussian Distributions

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 3 / 40

slide-6
SLIDE 6

Challenges to be solved

Reference paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift When we are faced with training a Deep Network with saturating nonlinearities: Lower/smaller learning rates Initialize the weights from Gaussian Distributions

figure credit: www.regentsprep.org

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 3 / 40

slide-7
SLIDE 7

Challenges to be solved

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 4 / 40

slide-8
SLIDE 8

Challenges to be solved

Reasons behind the problem: Parameters change during training Input distributions of each layer changes

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 4 / 40

slide-9
SLIDE 9

Challenges to be solved

Reasons behind the problem: Parameters change during training Input distributions of each layer changes

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 20 40 60 80 100 120 140 160 180 200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 20 40 60 80 100 120 140 160 180 200

Sigmoid’s output distribution before and after parameter updates

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 4 / 40

slide-10
SLIDE 10

Overview

1

Batch Normalization Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results

2

Importance of Initialization and Momentum Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 5 / 40

slide-11
SLIDE 11

Internal Covariate Shift

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 6 / 40

slide-12
SLIDE 12

Internal Covariate Shift

Covariate Shift

Change of input distributions to a Learning System

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 6 / 40

slide-13
SLIDE 13

Internal Covariate Shift

Covariate Shift

Change of input distributions to a Learning System Extension to part or sub-networks ℓ = F2(F1(u, Θ1), Θ2)

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 6 / 40

slide-14
SLIDE 14

Internal Covariate Shift

Covariate Shift

Change of input distributions to a Learning System Extension to part or sub-networks ℓ = F2(F1(u, Θ1), Θ2) ℓ = F2(x, Θ2), where x = F1(u, Θ1) Θ2 ← Θ2 − α m

m

  • i=1

∂F2(xi, Θ2) ∂Θ2

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 6 / 40

slide-15
SLIDE 15

Internal Covariate Shift

Covariate Shift

Change of input distributions to a Learning System Extension to part or sub-networks ℓ = F2(F1(u, Θ1), Θ2) ℓ = F2(x, Θ2), where x = F1(u, Θ1) Θ2 ← Θ2 − α m

m

  • i=1

∂F2(xi, Θ2) ∂Θ2 In terms of change in the distribution of x, Θ2 will not need to readjust much.

Internal Covariate Shift

Change in the distributions of internal nodes of a deep network

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 6 / 40

slide-16
SLIDE 16

Reducing Internal Covariate Shift

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 7 / 40

slide-17
SLIDE 17

Reducing Internal Covariate Shift

Whitening-LeCun et al., 1998b; Wiesler&Ney, 2011

The network training converges faster if its inputs are whitened-i.e., linearly transformed to have zero means and unit variances, and decorrelated.

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 7 / 40

slide-18
SLIDE 18

Reducing Internal Covariate Shift

Whitening-LeCun et al., 1998b; Wiesler&Ney, 2011

The network training converges faster if its inputs are whitened-i.e., linearly transformed to have zero means and unit variances, and decorrelated. Goal: Whitening the inputs of each layer to have fixed distributions in

  • rder to Reduce the ill effects of Internal Covariate Shift.

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 7 / 40

slide-19
SLIDE 19

Reducing Internal Covariate Shift

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 8 / 40

slide-20
SLIDE 20

Reducing Internal Covariate Shift

Interspersal lead to reduced gradient descent b ← b + ∆b, where ∆b ∝ − ∂ℓ ∂ˆ x ✁

✁ ✁ ✕

Ignored ∂ˆ x ∂b ˆ x = x − E[x] = u + (b + ∆b) − E[u + (b + ∆b)] = u + b − E[u + b]

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 8 / 40

slide-21
SLIDE 21

Reducing Internal Covariate Shift

Interspersal lead to reduced gradient descent b ← b + ∆b, where ∆b ∝ − ∂ℓ ∂ˆ x ✁

✁ ✁ ✕

Ignored ∂ˆ x ∂b ˆ x = x − E[x] = u + (b + ∆b) − E[u + (b + ∆b)] = u + b − E[u + b] Normalizations are NOT taken into account in Gradient Descent Optimization.

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 8 / 40

slide-22
SLIDE 22

Reducing Internal Covariate Shift

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 9 / 40

slide-23
SLIDE 23

Reducing Internal Covariate Shift

Introducing Normalization ˆ x = Norm(x, X) and Jacobians in backpropagation ∂Norm(x, X) ∂x and ∂Norm(x, X) ∂X

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 9 / 40

slide-24
SLIDE 24

Reducing Internal Covariate Shift

Introducing Normalization ˆ x = Norm(x, X) and Jacobians in backpropagation ∂Norm(x, X) ∂x and ∂Norm(x, X) ∂X New challenges: expensive to compute covariance matrix and its inverse square root. Covariance matrix Cov[x] = Ex∈X [xxT] − E[x]E[x]T Whitening Cov[x]−1/2(x − E[x])

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 9 / 40

slide-25
SLIDE 25

Overview

1

Batch Normalization Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results

2

Importance of Initialization and Momentum Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 10 / 40

slide-26
SLIDE 26

Mini-Batch Normalization

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 11 / 40

slide-27
SLIDE 27

Mini-Batch Normalization

Two simplifications and Identity Transform

Normalize each scalar feature independently Use mini-batch to estimate the mean and variance instead of whole population Ensure Identity Transform can be represented

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 11 / 40

slide-28
SLIDE 28

Mini-Batch Normalization

Two simplifications and Identity Transform

Normalize each scalar feature independently Use mini-batch to estimate the mean and variance instead of whole population Ensure Identity Transform can be represented y(k) = γ(k)ˆ x(k) + β(k) Two new parameters for each activation are introduced for learning.

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 11 / 40

slide-29
SLIDE 29

Mini-Batch Normalization

Two simplifications and Identity Transform

Normalize each scalar feature independently Use mini-batch to estimate the mean and variance instead of whole population Ensure Identity Transform can be represented y(k) = γ(k)ˆ x(k) + β(k) Two new parameters for each activation are introduced for learning. Batch Normalization Transform, see reference paper for details

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 11 / 40

slide-30
SLIDE 30

Overview

1

Batch Normalization Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results

2

Importance of Initialization and Momentum Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 12 / 40

slide-31
SLIDE 31

Key Points in Batch Normalization

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

slide-32
SLIDE 32

Key Points in Batch Normalization

Original parameters and newly introduced γ and β will be trained.

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

slide-33
SLIDE 33

Key Points in Batch Normalization

Original parameters and newly introduced γ and β will be trained. When in inference, the whole population of training data is used for mean and variance statistics instead of the estimate. E(x) ← EB[µB] Var[x] ← m m − 1EB[σ2

B]

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

slide-34
SLIDE 34

Key Points in Batch Normalization

Original parameters and newly introduced γ and β will be trained. When in inference, the whole population of training data is used for mean and variance statistics instead of the estimate. E(x) ← EB[µB] Var[x] ← m m − 1EB[σ2

B]

In Convolutional layers, different locations of a feature map should be normalized in the same way. m′ = |B| = m · pq, and γ(k), β(k) per feature map

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 13 / 40

slide-35
SLIDE 35

Key Points in Batch Normalization

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

slide-36
SLIDE 36

Key Points in Batch Normalization

Higher learning rates are allowed

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

slide-37
SLIDE 37

Key Points in Batch Normalization

Higher learning rates are allowed BN(Wu) = BN((aW )u)

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

slide-38
SLIDE 38

Key Points in Batch Normalization

Higher learning rates are allowed BN(Wu) = BN((aW )u) ∂BN(Wu) ∂u = ∂BN((aW )u) ∂u , ∂BN(Wu) ∂aW = 1 a · ∂BN((aW )u) ∂W

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

slide-39
SLIDE 39

Key Points in Batch Normalization

Higher learning rates are allowed BN(Wu) = BN((aW )u) ∂BN(Wu) ∂u = ∂BN((aW )u) ∂u , ∂BN(Wu) ∂aW = 1 a · ∂BN((aW )u) ∂W Batch Normalization will regularize the model with less overfitting.

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 14 / 40

slide-40
SLIDE 40

Overview

1

Batch Normalization Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results

2

Importance of Initialization and Momentum Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 15 / 40

slide-41
SLIDE 41

Activations over time

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 16 / 40

slide-42
SLIDE 42

Activations over time

Batch Normalization helps train faster and achieve higher accuracy.

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 16 / 40

slide-43
SLIDE 43

Activations over time

Batch Normalization helps train faster and achieve higher accuracy.

figure credit: reference paper

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 16 / 40

slide-44
SLIDE 44

Activations over time

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 17 / 40

slide-45
SLIDE 45

Activations over time

Batch Normalization makes input distribution more stable.

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 17 / 40

slide-46
SLIDE 46

Activations over time

Batch Normalization makes input distribution more stable.

figure credit: reference paper

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 17 / 40

slide-47
SLIDE 47

Accelerating Batch Normalization Networks

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 18 / 40

slide-48
SLIDE 48

Accelerating Batch Normalization Networks

Tricks to follow

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 18 / 40

slide-49
SLIDE 49

Accelerating Batch Normalization Networks

Tricks to follow

Increasing learning rate Remove or Reduce Dropout Reduce ℓ2 weight regularization Accelerate the learning rate decay Remove Local Response Normalization Shuffle training examples more thoroughly Reduce the photometric distortions

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 18 / 40

slide-50
SLIDE 50

Network Comparisons

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40

slide-51
SLIDE 51

Network Comparisons

Inception, BN-Baseline, BN-x5, BN-x30, BN-x5-Sigmoid

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40

slide-52
SLIDE 52

Network Comparisons

Inception, BN-Baseline, BN-x5, BN-x30, BN-x5-Sigmoid

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40

slide-53
SLIDE 53

Network Comparisons

Inception, BN-Baseline, BN-x5, BN-x30, BN-x5-Sigmoid

figure credit: reference paper

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 19 / 40

slide-54
SLIDE 54

Ensemble Classification

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 20 / 40

slide-55
SLIDE 55

Ensemble Classification

Top-5 validation error of 4.9% and test error of 4.82%, exceeds the estimated accuracy of human raters.

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 20 / 40

slide-56
SLIDE 56

Ensemble Classification

Top-5 validation error of 4.9% and test error of 4.82%, exceeds the estimated accuracy of human raters.

figure credit: reference paper

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 20 / 40

slide-57
SLIDE 57

Overview

1

Batch Normalization Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results

2

Importance of Initialization and Momentum Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 21 / 40

slide-58
SLIDE 58

Challenges to be solved

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

slide-59
SLIDE 59

Challenges to be solved

Reference paper: On the importance of initialization and momentum in deep learning

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

slide-60
SLIDE 60

Challenges to be solved

Reference paper: On the importance of initialization and momentum in deep learning Difficult to use first-order method to reach performance previously

  • nly achievable by second-order method like Hessian-Free.

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

slide-61
SLIDE 61

Challenges to be solved

Reference paper: On the importance of initialization and momentum in deep learning Difficult to use first-order method to reach performance previously

  • nly achievable by second-order method like Hessian-Free.

Well-designed random initialization Slowly increasing schedule for momentum parameter

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

slide-62
SLIDE 62

Challenges to be solved

Reference paper: On the importance of initialization and momentum in deep learning Difficult to use first-order method to reach performance previously

  • nly achievable by second-order method like Hessian-Free.

Well-designed random initialization Slowly increasing schedule for momentum parameter No need for sophisticated second-order methods.

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 22 / 40

slide-63
SLIDE 63

Overview of first-order method

First-order Methods

Vanilla Stochastic Gradient Descent SGD + Momentum Nesterov’s Accelerated Gradient(NAG) AdaGrad Adam Rprop RMSProp AdaDelta

slide credit: Ishan Misra

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 23 / 40

slide-64
SLIDE 64

Overview

1

Batch Normalization Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results

2

Importance of Initialization and Momentum Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 24 / 40

slide-65
SLIDE 65

Several First-order Methods

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 25 / 40

slide-66
SLIDE 66

Several First-order Methods

Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽f - Gradient of f , v - Velocity vector, µ - Momentum coefficient

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 25 / 40

slide-67
SLIDE 67

Several First-order Methods

Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽f - Gradient of f , v - Velocity vector, µ - Momentum coefficient

Vanilla SGD

vt+1 = ǫ▽f (θt) θt+1 = θt − vt+1

slide credit: Ishan Misra

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 25 / 40

slide-68
SLIDE 68

Several First-order Methods

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 26 / 40

slide-69
SLIDE 69

Several First-order Methods

Rprop Update

if ▽ft▽ft−1 > 0 vt = η+vt−1 else if ▽ft▽ft−1 < 0 vt = η−vt−1 else vt = vt−1 θt+1 = θt − vt where 0 < η− < 1 < η+

slide credit: Ishan Misra

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 26 / 40

slide-70
SLIDE 70

Several First-order Methods

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 27 / 40

slide-71
SLIDE 71

Several First-order Methods

AdaGrad

rt = θ2

t + rt−1

vt+1 =

α √rt ▽f (θt)

θt+1 = θt − vt+1

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 27 / 40

slide-72
SLIDE 72

Several First-order Methods

AdaGrad

rt = θ2

t + rt−1

vt+1 =

α √rt ▽f (θt)

θt+1 = θt − vt+1

RMSProp = Rprop + SGD

rt = (1 − γ)θ2

t + γrt−1

vt+1 =

α √rt ▽f (θt)

θt+1 = θt − vt+1

slide credit: Ishan Misra

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 27 / 40

slide-73
SLIDE 73

Several First-order Methods

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 28 / 40

slide-74
SLIDE 74

Several First-order Methods

AdaDelta

vt+1 = H−1▽f , ∝ f ′

f ′′

1/units of θ (1/units of θ)2

∝ units of θ

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 28 / 40

slide-75
SLIDE 75

Several First-order Methods

AdaDelta

vt+1 = H−1▽f , ∝ f ′

f ′′

1/units of θ (1/units of θ)2

∝ units of θ

Adam

rt = (1 − γ1)▽f (θt) + γ1rt−1 pt = (1 − γ2)▽f (θt)2 + γ2pt−1 ˆ rt =

rt (1−(1−γ1)t)

ˆ pt =

pt (1−(1−r2)t)

vt = α ˆ

rt √ˆ pt

θt+1 = θt − vt

slide credit: Ishan Misra

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 28 / 40

slide-76
SLIDE 76

Overview

1

Batch Normalization Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results

2

Importance of Initialization and Momentum Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 29 / 40

slide-77
SLIDE 77

Momentum and NAG

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

slide-78
SLIDE 78

Momentum and NAG

Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽f - Gradient of f , v - Velocity vector, µ - Momentum coefficient

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

slide-79
SLIDE 79

Momentum and NAG

Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽f - Gradient of f , v - Velocity vector, µ - Momentum coefficient

Classical Momentum

vt+1 = µvt − ǫ▽f (θt) θt+1 = θt + vt+1

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

slide-80
SLIDE 80

Momentum and NAG

Notation: θ - Parameters of network, f - Objective function, ǫ - Learning rate ▽f - Gradient of f , v - Velocity vector, µ - Momentum coefficient

Classical Momentum

vt+1 = µvt − ǫ▽f (θt) θt+1 = θt + vt+1

Nesterov’s Accelerated Gradient

vt+1 = µvt − ǫ▽f (θt + µvt) θt+1 = θt + vt+1

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 30 / 40

slide-81
SLIDE 81

Relationship between CM and NAG

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40

slide-82
SLIDE 82

Relationship between CM and NAG

NAG uses θt + µvt but MISSING the yet unknown correction. Thus when the addition of µvt results in an immediate undesirable increase in the

  • bjective f ,

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40

slide-83
SLIDE 83

Relationship between CM and NAG

NAG uses θt + µvt but MISSING the yet unknown correction. Thus when the addition of µvt results in an immediate undesirable increase in the

  • bjective f ,

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40

slide-84
SLIDE 84

Relationship between CM and NAG

NAG uses θt + µvt but MISSING the yet unknown correction. Thus when the addition of µvt results in an immediate undesirable increase in the

  • bjective f ,

figure credit: reference paper

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 31 / 40

slide-85
SLIDE 85

Relationship between CM and NAG

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 32 / 40

slide-86
SLIDE 86

Relationship between CM and NAG

Apply CM and NAG to a positive definite quadratic objective q(x) = xTAx/2 + bTx.

Difference in effective momentum coefficient

Classical Momentum: µ NAG: µ(1 − λǫ), where λ is the eigenvalue of A.

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 32 / 40

slide-87
SLIDE 87

Relationship between CM and NAG

Apply CM and NAG to a positive definite quadratic objective q(x) = xTAx/2 + bTx.

Difference in effective momentum coefficient

Classical Momentum: µ NAG: µ(1 − λǫ), where λ is the eigenvalue of A. ǫ small, CM and NAG are equivalent ǫ large, NAG gives smaller µ(1 − λiǫ) to stop oscillations.

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 32 / 40

slide-88
SLIDE 88

Overview

1

Batch Normalization Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results

2

Importance of Initialization and Momentum Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 33 / 40

slide-89
SLIDE 89

Deep Autoencoders

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 34 / 40

slide-90
SLIDE 90

Deep Autoencoders

Structure of Deep Autoencoder

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 34 / 40

slide-91
SLIDE 91

Deep Autoencoders

Structure of Deep Autoencoder

figure credit: http://deeplearning4j.org/deepautoencoder.html

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 34 / 40

slide-92
SLIDE 92

Deep Autoencoders

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

slide-93
SLIDE 93

Deep Autoencoders

Sparse Initialization-each random unit connected to 15 randomly chosen units in the previous layer, drawn from a unit Gaussian Schedule for Momentum Coefficient µt = min(1 − 2−1−log2(⌊t/250⌋+1), µmax)

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

slide-94
SLIDE 94

Deep Autoencoders

Sparse Initialization-each random unit connected to 15 randomly chosen units in the previous layer, drawn from a unit Gaussian Schedule for Momentum Coefficient µt = min(1 − 2−1−log2(⌊t/250⌋+1), µmax)

µt = 1 − 3/(t + 5), not strongly convex - Nesterov(1983) constant µt, strongly convex - Nesterov(2003)

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

slide-95
SLIDE 95

Deep Autoencoders

Sparse Initialization-each random unit connected to 15 randomly chosen units in the previous layer, drawn from a unit Gaussian Schedule for Momentum Coefficient µt = min(1 − 2−1−log2(⌊t/250⌋+1), µmax)

µt = 1 − 3/(t + 5), not strongly convex - Nesterov(1983) constant µt, strongly convex - Nesterov(2003)

table credit: reference paper

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 35 / 40

slide-96
SLIDE 96

RNN - Echo-State Networks

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 36 / 40

slide-97
SLIDE 97

RNN - Echo-State Networks

Echo-State Networks(a family RNNs)

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 36 / 40

slide-98
SLIDE 98

RNN - Echo-State Networks

Echo-State Networks(a family RNNs)

figure credit: Mantas Lukoevicius

Hidden-to-output connections learned from data Recurrent connections fixed to a random draw from a specific distribution

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 36 / 40

slide-99
SLIDE 99

RNN - Echo-State Networks

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 37 / 40

slide-100
SLIDE 100

RNN - Echo-State Networks

ESN-based Initialization Spectral Radius of Hidden-to-hidden matrix around 1.1 Initial scale of Input-to-hidden connections plays an important role (Gaussian draw with a standard deviation of 0.001 achieves good balance, but is Task Dependent)

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 37 / 40

slide-101
SLIDE 101

RNN - Echo-State Networks

ESN-based Initialization Spectral Radius of Hidden-to-hidden matrix around 1.1 Initial scale of Input-to-hidden connections plays an important role (Gaussian draw with a standard deviation of 0.001 achieves good balance, but is Task Dependent) Schedule of Momentum coefficient µ µ = 0.9 for the first 1000 parameters; µ = µ0 ∈ {0, 0.9, 0.98, 0.995} afterwards;

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 37 / 40

slide-102
SLIDE 102

RNN - Echo-State Networks

ESN-based Initialization Spectral Radius of Hidden-to-hidden matrix around 1.1 Initial scale of Input-to-hidden connections plays an important role (Gaussian draw with a standard deviation of 0.001 achieves good balance, but is Task Dependent) Schedule of Momentum coefficient µ µ = 0.9 for the first 1000 parameters; µ = µ0 ∈ {0, 0.9, 0.98, 0.995} afterwards;

Table credit: reference paper

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 37 / 40

slide-103
SLIDE 103

Overview

1

Batch Normalization Internal Covariate Shift Mini-Batch Normalization Key Points in Batch Normalization Experiments and Results

2

Importance of Initialization and Momentum Overview of first-order method Momentum & Nesterov’s Accelerated Gradient(NAG) Deep Autoencoders & RNN - Echo-State Networks

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 38 / 40

slide-104
SLIDE 104

Questions?

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 39 / 40

slide-105
SLIDE 105

Chris Dusold’s Part

Variance-SGD(No More Pesky Learning Rates) Adam(Adam: A Method for Stochastic Optimization) AdaGrad(Adaptive Subgradient Methods for Online Learning and Stochastic Optimization)

Yufeng Ma, Chris Dusold (Virginia Tech) Hyper-parameters/Tweaking November 17, 2015 40 / 40