Deep Learning: Part 2 Graduate School of Culture Technology, KAIST - - PowerPoint PPT Presentation

deep learning part 2
SMART_READER_LITE
LIVE PREVIEW

Deep Learning: Part 2 Graduate School of Culture Technology, KAIST - - PowerPoint PPT Presentation

GCT634: Musical Applications of Machine Learning Deep Learning: Part 2 Graduate School of Culture Technology, KAIST Juhan Nam Outlines Convolutional Neural Networks (CNN) - Introduction - Mechanics - CNN for music classification Training


slide-1
SLIDE 1

GCT634: Musical Applications of Machine Learning

Deep Learning: Part 2

Graduate School of Culture Technology, KAIST Juhan Nam

slide-2
SLIDE 2

Outlines

  • Convolutional Neural Networks (CNN)
  • Introduction
  • Mechanics
  • CNN for music classification
  • Training Neural Network Models
  • Preprocessing data
  • Building a model
  • Training
slide-3
SLIDE 3

Convolutional Neural Network (CNN)

  • Neural network that contain convolutional layer and

subsampling (or pooling) layer

  • Local filters (or weight) are convolved with input or hidden layers and return

feature maps

  • The feature maps are sub-sampled (or pooled) to reduce the dimensionality

INPUT 32x32

Convolutions Subsampling Convolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84

Full connection Full connection Gaussian connections

OUTPUT 10

( LeCun 98)

slide-4
SLIDE 4

History

  • Highly related to human visual recognition
  • Receptive field, Simple/Complex cells (Hubel and Wiesel, 1962)
  • NeoCognition (Fukushima, 1980): early computational model
  • LeNet (LeCun, 1998): the first CNN model, applied to hand-written zip code

(Manassi, 2013)

slide-5
SLIDE 5

History

  • The breakthrough in image classification (2012)
  • CNN trained with 2 GPUs and 1.2M images during one week
  • ReLU (fast and non-saturated), dropout (regularization)
  • ImageNet challenge: top-5% error 15.3% (>10% lower than the second)
  • Opened the era of deep learning

(Krizhevsky et. al., 2012)

slide-6
SLIDE 6

ImageNet Challenge

  • 2010-11: hand-crafted features + classifiers
  • 2012-2016: ConvNets
  • 2012: AlexNet
  • 2013: ZFNet
  • 2014: VGGNet, InceptionNet
  • 2015: ResNet
  • 2016: Ensemble networks
  • 2017: Squeeze and Excitation Net

(NIPS 2017 Tutorial – Deep Learning: Practice and Trend)

slide-7
SLIDE 7
  • Learned features are similar to those in the human visual system

Hierarchical Representation Learning

(Zeiler and Fergus, 2013) Trainable Classifier Low-Level Feature Mid-Level Feature High-Level Feature

Borrowed from LeCun’s slides

slide-8
SLIDE 8

Convolutional Neural Networks

  • ConvNet exploits these two properties
  • Locality: objects tend to have a local spatial support
  • Translation invariance: object appearance is independent of location

(NIPS 2017 Tutorial – Deep Learning: Practice and Trend)

The bird occupies a local area and looks the same in different part of an image

slide-9
SLIDE 9

Convolutional Neural Networks

  • ConvNet exploits these two properties
  • Locality: objects tend to have a local spatial support
  • Translation invariance: object appearance is independent of location
  • Counter examples: face images (especially, passport photo)

(from the MS-Celeb-1M dataset)

slide-10
SLIDE 10

Convolutional Neural Networks

  • Locality and translation invariance appear in audio and text, too
slide-11
SLIDE 11

Incorporating Assumptions: Locality

  • Make fully-connected layer locally-connected
  • Each neuron is connected to a local area (receptive field)
  • Different neurons connected to different locations (feature map)

(NIPS 2017 Tutorial – Deep Learning: Practice and Trend)

slide-12
SLIDE 12

Incorporating Assumptions: Translation Invariance

  • Weight sharing: units connected to different locations have the

same weights (filters)

  • Convolutional layer: locally-connected layer with weight sharing
  • The weight are invariant, the output is equivalent

(NIPS 2017 Tutorial – Deep Learning: Practice and Trend)

(e.g. face recognition)

slide-13
SLIDE 13

Convolution Mechanics

  • Image and Feature Map
  • 3D tensor: width, height and depth (channel)
  • Channel (3): R, G, B

Channel (or Depth) Width Height Filter

2D convolution

Filter must have the same depth as the input has Hidden unit Height Width Channel This channel corresponds to the number of filters! Note that they become a 4D tensor when batch or mini-batch is used

slide-14
SLIDE 14

Convolution Mechanics

  • Stride
  • Sliding with hopping (equivalent to hop size in STFT)
  • Padding
  • Zero-padding to the border to adjust feature map size or to take care of

striding (equivalent to zero-padding in STFT)

  • Convolution Animation
  • https://github.com/vdumoulin/conv_arithmetic

The Output size is (N-F)/S +1 (N: input size, F: filter size, S: stride size) Do zero-padding if (N-F)/S is not integer

slide-15
SLIDE 15

Convolution Mechanics

Fei-Fei Li & Justin Johnson & Serena Yeung

April 18, 2017 Lecture 5 -

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 5 - April 18, 2017 62

Common settings: K = (powers of 2, e.g. 32, 64, 128, 512)

  • F = 3, S = 1, P = 1
  • F = 5, S = 1, P = 2
  • F = 5, S = 2, P = ? (whatever fits)
  • F = 1, S = 1, P = 0

(Stanford CS231n Slides)

slide-16
SLIDE 16

Sub-Sampling (or Pooling)

  • Summarize the feature map into smaller-dimensional feature map
  • It is the core logic to make the features translation-invariant
  • Types
  • Max-pooling: most popular choice
  • Average pooling
  • Standard deviation pooling

1 5 2 4 2 3 9 1 5 3 3 4 7 8 2 2 5 9 8 4 2 x 2 max pooling Stride with 2

slide-17
SLIDE 17

ConvNet Demo: Image Classification

  • https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.h

tml

slide-18
SLIDE 18

Designing CNN for Music Classification

  • Input data
  • Spectrogram
  • Log-Spectrogram: Mel or Constant-Q
  • Raw Waveforms
  • CNN structure
  • 1D-CNN
  • 2D-CNN
  • Sample-CNN
slide-19
SLIDE 19

1D CNN

  • Assumes the locality and translation invariance only on time axis
  • The receptive filter covers the whole frequency range (1D feature map)
  • The first fully-connected layer takes globally pooled features

Spectrogram

  • utput

Time Convolution and Pooling layers Fully-Connected layers Frequency Channel

slide-20
SLIDE 20

1D CNN

  • Assumes the locality and translation invariance only on time axis
  • The receptive filter covers the whole frequency range (1D feature map)
  • The first fully-connected layer takes globally pooled features
  • Another view

Frequency (Channel) Time Channel

slide-21
SLIDE 21

1D CNN: Example

  • Dieleman (2014)
  • http://benanne.github.io/2014/08/05/spotify-cnns.html
slide-22
SLIDE 22

1D CNN

  • Advantage
  • The 1D features map significantly reduces the number of parameters

(compared to the 2D feature map)

  • Fast to train
  • Work well for small datasets
  • Disadvantage
  • Not invariant to pitch shifting
  • Key transpose changes the feature activation and different results
slide-23
SLIDE 23

2D CNN

  • Assumes the translation invariance on both time and frequency
  • The receptive filter covers a time-frequency patch (typically 3x3)
  • Log-spec spectrogram is required as input
slide-24
SLIDE 24

2D CNN: Example

  • Choi et. al. (2016)
  • VGGNet style

FCN-4 Mel-spectrogram (input: 96×1366×1) Conv 3×3×128 MP (2, 4) (output: 48×341×128) Conv 3×3×384 MP (4, 5) (output: 24×85×384) Conv 3×3×768 MP (3, 8) (output: 12×21×768) Conv 3×3×2048 MP (4, 8) (output: 1×1×2048) Output 50×1 (sigmoid)

slide-25
SLIDE 25

2D CNN

  • Advantage
  • Relatively invariant to pitch shifting
  • Learn more general features in the bottom layers
  • Exploit advanced techniques for image classification
  • Disadvantage
  • The 2D features map significantly increases the number of parameters

(compared to the 2D feature map)

  • Require a large-scale dataset and accordingly more computational

resources (e.g. GPU and memory)

slide-26
SLIDE 26

Sample-CNN

  • End-to-end model that takes raw waveforms directly
  • The receptive field can vary from frame-level (e.g. 256 sample) to sample-

level (e.g. 2 or 3 samples)

  • The CNN must be sufficiently deep to learn the variations within a frame

time Convolution and Pooling layers

Block 2 Block 1 Block 3

1D convolutional blocks multi-level

global max pooling global max pooling

Conv1D BatchNorm MaxPool

relu relu relu sigmoid relu T×C 1×C 1×αC 1×C T×C T×C

Conv1D BatchNorm Dropout Conv1D BatchNorm

GlobalAvgPool

FC FC Scale MaxPool

...

Layer 2 Layer 1 Layer 3

1D convolutional blocks

... ... ...

Large size filter & strides

Layer 4 Block 4

Output

slide-27
SLIDE 27

Sample-CNN: Example

  • Lee et. al. (2017)
  • Short filters work better than long ones

39 model, 19683 frames 59049 samples (2678 ms) as input layer stride

  • utput

# of params conv 3-128 3 19683 × 128 512 conv 3-128 maxpool 3 1 3 19683 × 128 6561 × 128 49280 conv 3-128 maxpool 3 1 3 6561 × 128 2187 × 128 49280 conv 3-256 maxpool 3 1 3 2187 × 256 729 × 256 98560 conv 3-256 maxpool 3 1 3 729 × 256 243 × 256 196864 conv 3-256 maxpool 3 1 3 243 × 256 81 × 256 196864 conv 3-256 maxpool 3 1 3 81 × 256 27 × 256 196864 conv 3-256 maxpool 3 1 3 27 × 256 9 × 256 196864 conv 3-256 maxpool 3 1 3 9 × 256 3 × 256 196864 conv 3-512 maxpool 3 1 3 3 × 512 1 × 512 393728 conv 1-512 dropout 0.5 1 − 1 × 512 1 × 512 262656 sigmoid − 50 25650 Total params 1.9 × 106 Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland SMC2017-222 Sample-level raw waveform model

Sample-level strided convolution layer

Proceedings of the 14th Sound and Music Computing Conference, July 5-8, Espoo, Finland SMC2017-221

slide-28
SLIDE 28

Sample-CNN

  • Advantage
  • No need of tuning STFT and log-scale parameters
  • The (sub-)optimal parameters are different depending on data and tasks
  • No need of storing the preprocessed spectrogram
  • Disadvantage
  • More parameters and memory
  • Slow to train
slide-29
SLIDE 29

Training Neural Network Models

  • Preprocessing data
  • Data augmentation
  • Normalization
  • Training
  • Loss optimization and monitoring loss
  • Early Stopping
  • Hyper-parameter optimization
  • Building a model
  • CNN structure: 1D or 2D, filter size/number, pooling size, …
  • Loss function
  • Batch normalization
  • Dropout, weight decay
  • Weight Initialization
slide-30
SLIDE 30

Training Neural Network Models

  • Preprocessing data
  • Data augmentation
  • Normalization
  • Training
  • Loss optimization and monitoring loss
  • Early Stopping
  • Hyper-parameter optimization
  • Building a model
  • CNN structure: 1D or 2D, filter size/number, pooling size, …
  • Loss function
  • Batch normalization
  • Dropout, weight decay
  • Weight Initialization
slide-31
SLIDE 31

Training Neural Network Models

  • Preprocessing data
  • Data augmentation
  • Normalization
  • Training
  • Loss optimization and monitoring loss
  • Early Stopping
  • Hyper-parameter optimization
  • Building a model
  • CNN structure: 1D or 2D, filter size/number, pooling size, …
  • Loss function
  • Batch normalization
  • Dropout, weight decay
  • Weight Initialization

Regularization!

slide-32
SLIDE 32

Bias-Variance Trade-Off

  • Assumption: data given to us is not sufficient
  • Bias: how much the learned model is different from the true model
  • Variance: how much learned models are different from each other when

they are trained with different training sets

  • Under-fitting: high bias and low variance
  • Over-fitting: low bias and high variance

“Jazz” “Classical” Feature Space

Under-fitting

“Jazz” “Classical” Feature Space

good-fitting

“Jazz” “Classical” Feature Space

  • ver-fitting
slide-33
SLIDE 33

Regularization

  • Overfitting is more likely in DNN and thus we need to regularize it
  • Regularization methods
  • Weight Decay
  • Dropout
  • Early stopping
  • Data augmentation
slide-34
SLIDE 34

Weight Decay

  • L2 regularization: 𝑆 𝑋 = ∑

𝑋

%,' ( %,'

(weight decay)

  • L1 regularization: 𝑆 𝑋 = ∑

𝑋

%,' %,'

  • The regularization terms are computed for each layer and can

be added to the loss separately

  • The bias terms are usually not included

𝑀 𝜄 = + 𝑔 𝑦., 𝑧0 𝑋

1 023

+ 𝜇𝑆(𝑋)

slide-35
SLIDE 35

Dropout

  • In each feedforward pass, the hidden layer units are randomly

turned off

  • The binary mask is Implemented by multiplying the hidden units with binary

random variables

  • The probability is a hyper-parameter

(Srivastava et. al, 2014)

slide-36
SLIDE 36

Dropout: why does this help?

  • Prevent co-adaptation of hidden units
  • Co-adaptation: two or more hidden units detect the same features.
  • This waste of resources can be prevented by the dropout
  • Dropout enables a large ensemble of models
  • An FC layer with 1024 units has 2^1024 combinations of masks
  • They share the parameters
slide-37
SLIDE 37

Dropout: Implementation

  • Inverted dropout

(Stanford CS231n Slides)

slide-38
SLIDE 38

Early Stopping

  • Stop training (or reduce the learning rate) if the loss on the

validation set does not decrease

  • Track the minimum loss and wait until it the minimum is sustained

10 20 30 40 50 0.15 0.2 0.25 10 20 30 40 50 0.35 0.4 0.45

Training Error Validation Error (The PRML book)

slide-39
SLIDE 39

Data Augmentation

  • Increasing the quantity of input data while preserving their

corresponding output property

  • Digital audio effects usually preserve the categorical level of music
  • Commonly used digital audio effects
  • Pitch shifting
  • Resampling
  • Time-stretching
  • Equalization
  • Adding noises
slide-40
SLIDE 40

Training Neural Network Models

  • Preprocessing data
  • Data augmentation
  • Normalization
  • Training
  • Loss optimization and monitoring loss
  • Early Stopping
  • Hyper-parameter optimization
  • Building a model
  • CNN structure: 1D or 2D, filter size/number, pooling size, …
  • Loss function
  • Batch normalization
  • Dropout, weight decay
  • Weight Initialization
slide-41
SLIDE 41

Training Neural Network Models

  • Preprocessing data
  • Data augmentation
  • Normalization
  • Training
  • Loss optimization and monitoring loss
  • Early Stopping
  • Hyper-parameter optimization
  • Building a model
  • CNN structure: 1D or 2D, filter size/number, pooling size, …
  • Loss function
  • Batch normalization
  • Dropout, weight decay
  • Weight Initialization

Normalization!

slide-42
SLIDE 42

Normalization: Preprocessing

  • Standardization
  • PCA-whitening

Mean Subtraction Std. Division Mean Subtraction Rotation & Std Division Be careful with dividing by zero

slide-43
SLIDE 43

Normalization: Preprocessing

  • In practice,
  • Mean subtraction is important
  • Standard deviation division is not very common but it seems to help in

music classification (e.g. log-compressed spectrogram)

  • PCA-whiting is also not very common
  • Common pitfall
  • The mean and standard deviation must be computed only from the training

data (not the entire dataset)

  • They should be consistently used for the validation and test data
slide-44
SLIDE 44

Weight Initialization

  • Method 1: small random number
  • W = 0.01*np.random.rand(D,H)(D:input size, H:output size)
  • Set Gaussian random numbers to weights and zeros to bias
  • Work well for small network but not for deep networks
  • Method2: “Xavier initialization”
  • W = 0.01*np.random.rand(D,H)/sqrt(D)
  • Works well for tanh functions (not for ReLU)
  • Method3: “He initialization”
  • W = 0.01*np.random.rand(D, H)/sqrt(D/2)
  • Works well for ReLU (Recommended)
slide-45
SLIDE 45

Batch Normalization

  • Difficulty in training deep neural networks
  • Gradient vanishing / exploding
  • Internal covariance shift of activations at each layer
  • Idea
  • Maintain a unit Gaussian distribution on the activations

through the network!

  • Note that the weight initialization methods are also proposed

to have this consistent distribution on the activations

slide-46
SLIDE 46

Batch Normalization

  • Additional layer between FC (or Conv) and nonlinear function
  • Compute the normalization for each mini-batch
  • Scale and shift to exploit the nonlinear: input with zero mean and unit

variance are mostly in the linear range (sigmoid or tanh)

  • Scale and shift are learned via back-propagation

FC or Conv Batch Norm Nonlinear (Ioffe and Szegedy, 2015)

slide-47
SLIDE 47

Batch Normalization

  • Advantages
  • Enable higher learning rates: improve gradient flow through the networks
  • Regularize the model: dropout or L2 weight regularization can be removed
  • r reduced

5M 10M 15M 20M 25M 30M 0.4 0.5 0.6 0.7 0.8 Inception BN−Baseline BN−x5 BN−x30 BN−x5−Sigmoid Steps to match Inception

Figure 2: Single crop validation accuracy of Inception and its batch-normalized variants, vs. the number of training steps.

Batch Normalization is a must-have layer!!!

ImageNet Classification (Ioffe and Szegedy, 2015)

slide-48
SLIDE 48

Training Neural Network Models

  • Preprocessing data
  • Data augmentation
  • Normalization
  • Training
  • Loss optimization and monitoring loss
  • Early Stopping
  • Hyper-parameter optimization
  • Building a model
  • CNN structure: 1D or 2D, filter size/number, pooling size, …
  • Loss function
  • Batch normalization
  • Dropout, weight decay
  • Weight Initialization
slide-49
SLIDE 49

Training Neural Network Models

  • Preprocessing data
  • Data augmentation
  • Normalization
  • Training
  • Loss optimization and monitoring loss
  • Early Stopping
  • Hyper-parameter optimization
  • Building a model
  • CNN structure: 1D or 2D, filter size/number, pooling size, …
  • Loss function
  • Batch normalization
  • Dropout, weight decay
  • Weight Initialization

Optimization!

slide-50
SLIDE 50

Optimizers

  • Stochastic Gradient Descent (SGD)
  • Problems with SGD
  • Optimization can be very slow when the loss function has high condition

number

𝑦893 = 𝑦8 − 𝛽∇𝑔(𝑦8)

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 7 - April 25, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 7 - April 25, 2017 16

Optimization: Problems with SGD

What if loss changes quickly in one direction and slowly in another? What does gradient descent do? Very slow progress along shallow dimension, jitter along steep direction Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

(Stanford CS231n Slides)

slide-51
SLIDE 51

Optimizers

  • Problems with SGD
  • The gradient becomes zero and the update gets stuck
  • Saddle points are common in high dimension

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 7 - April 25, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 7 - April 25, 2017 18

Optimization: Problems with SGD

What if the loss function has a local minima or saddle point? Zero gradient, gradient descent gets stuck

Local Minimum Saddle Point

(Stanford CS231n Slides)

slide-52
SLIDE 52

Momentum

  • Update the parameters using not only the current gradient but

also the previous history

  • Equivalent to the leaky integrator (IIR filter)

𝑦893 = 𝑦8 + 𝑤893 𝑤893 = 𝜍𝑤8 − 𝛽∇𝑔(𝑦8)

Current gradients (with learning rate) Accumulated gradients

slide-53
SLIDE 53

Momentum

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 7 - April 25, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 7 - April 25, 2017 22

SGD + Momentum

Local Minima Saddle points Poor Conditioning Gradient Noise It’s like rolling a ball

(Stanford CS231n Slides)

slide-54
SLIDE 54
  • Compute the gradient at the tip of the velocity (in advance)

Nesterov Momentum

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 7 - April 25, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 7 - April 25, 2017 24

Gradient Velocity actual step Momentum update:

Nesterov Momentum

Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983 Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004 Sutskever et al, “On the importance of initialization and momentum in deel learning”, ICML 2013

Gradient Velocity actual step Nesterov Momentum

𝑦893 = 𝑦8 + 𝑤893 𝑤893 = 𝜍𝑤8 − 𝛽∇𝑔(𝑦8 + 𝜍𝑤8)

𝛼𝑔(𝑦8) 𝛼𝑔(𝑦8 + 𝜍𝑤8) 𝑦8 + 𝜍𝑤8 𝑦8 + 𝜍𝑤8

(Stanford CS231n Slides)

slide-55
SLIDE 55

Nesterov Momentum

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 7 - April 25, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 7 - April 25, 2017 28

Nesterov Momentum

SGD SGD+Momentum Nesterov

(Stanford CS231n Slides)

slide-56
SLIDE 56

Per-Parameter Update

  • Use an adaptive learning rate instead of a constant value
  • Every parameter has a different learning rate
  • Types
  • AdaGrad
  • RMSProp
  • Adam
  • AdaDelta

𝑦893 = 𝑦8 − 𝛽𝛼𝑔(𝑦8) 𝑦893,0 = 𝑦8,0 − 𝛽0𝛼𝑔(𝑦8,0)

slide-57
SLIDE 57

AdaGrad

  • Dividing the learning rate by the past amount of gradients per

parameters

  • Decrease the learning rate for the much updated parameters
  • Increase the learning rate for the less updated parameters
  • The learning rate is automatically adjusted
  • However, 𝑕0 keeps growing and so it is likely to make the

learning rate to zero 𝑦893,0 = 𝑦8,0 − 𝛽 𝑕0 + 𝜗

  • 𝛼𝑔(𝑦8,0)

𝑕0 = + 𝛼𝑔(𝑦8,0)(

8

slide-58
SLIDE 58

RMSProp

  • To prevent the problem in AdaGrad, RMSProp takes a weighted

sum (exponentially decaying) 𝑦893,0 = 𝑦8,0 − 𝛽 𝑕893,0 + 𝜗

  • 𝛼𝑔(𝑦8,0)

𝑕893,0 = 𝛾𝑕8,0 + (1 − 𝛾)𝛼𝑔(𝑦8,0)(

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 7 - April 25, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 7 - April 25, 2017 33

RMSProp

SGD SGD+Momentum RMSProp

(Stanford CS231n Slides)

slide-59
SLIDE 59

Adam

  • Adaptive momentum estimation
  • RMSProp + momentum: rolling ball (momentum) with friction (RMSProp)
  • The widely used optimizer

𝑦893,0 = 𝑦8,0 − 𝛽 𝑕893,0 + 𝜗

  • 𝑤893

𝑕893,0 = 𝛾𝑕8,0 + (1 − 𝛾)𝛼𝑔(𝑦8,0)( 𝑤893 = 𝜍𝑤8 + (1 − 𝜍)∇𝑔(𝑦8)

First moment Second moment

slide-60
SLIDE 60

Optimizer Animation

Alec Radford: https://imgur.com/a/Hqolp

slide-61
SLIDE 61

Hyper-Parameter Optimization

  • What you usually adjust during training for a given model
  • Initial learning rate
  • Learning rate decay schedule (Annealing)
  • Reduce the learning rate by some factor every few epochs after a fixed number
  • f epoch or whenever early stopping occurs
  • Regularization strength: L2 or dropout probability

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 7 - April 25, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 7 - April 25, 2017 42

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.

Loss Epoch Learning rate decay! More critical with SGD+Momentum, less common with Adam (Stanford CS231n Slides)

slide-62
SLIDE 62

In Summary: Recommended Settings

  • ReLU for the activation functions
  • Batch normalization
  • Dropout
  • Standardization (preprocessing)
  • He initialization
  • Adam and Nesterov Momentum
  • Early stopping and Annealing
  • Data Augmentation
  • (Ensemble)