Training Neural Networks: Normalization, Regularization etc.
Intro to Deep Learning, Fall 2020
1
Training Neural Networks: Normalization, Regularization etc. Intro - - PowerPoint PPT Presentation
Training Neural Networks: Normalization, Regularization etc. Intro to Deep Learning, Fall 2020 1 Quick Recap: Training a network Divergence between desired output and Average over all actual output of net for a given input training instances
1
– Quantifies the difference between desired output and the actual
Total loss Average over all training instances Divergence between desired output and actual output of net for a given input Output of net in response to input Desired output in response to input
2
Solved through gradient descent as
3
update to the parameters can be terribly inefficient
quicker updates
– “Stochastic Gradient Descent” updates parameters after each instance – “Mini batch descent” updates them after batches of instances – Require shrinking learning rates to converge
– Potentially leading to worse model estimates
4
methods by considering long-term trends in gradients
– Leading to faster and more assured convergence
smoothing updates with the mean (first moment) of the sequence
– RMS Prop only considers the second moment of the derivatives – ADAM and its siblings consider both the first and second moments – All of them typically provide considerably faster than simple gradient descent
5
6
7
8
9
– Slow convergence of gradient descent
– Gradient descent will not converge easily
– But not too shallow: ideally quadratic in nature
10
11
Desired output: Desired output: L2 KL
2 3 4 Softmax
12
– Setup: 2-dimensional input – 100 training examples randomly generated
13
– Setup: 2-dimensional input – 100 training examples randomly generated
14
15
16
– Minibatches have similar distribution
– A “covariate shift”
17
– Minibatches have similar distribution
– A “covariate shift” – Which may occur in each layer of the networkg badly
18
– Minibatches have similar distribution
– A “covariate shift”
– All covariate shifts can affect training badly
19
20
21
22
23
24
25
– Eliminates covariate shift between batches
26
after the weighted addition of inputs but before the application of activation
– Is done independently for each unit, to simplify computation
27
batch by them
28
Covariate shift to
Shift to new location in space Neuron-specific terms Minibatch mean Minibatch standard deviatiation
batch by them
Minibatch mean
Batch normalization
Minibatch standard deviation
batch by them
zero-mean unit variance Shift to right position
Batch normalization
31
and desired outputs of the network for all inputs in the minibatch
average of the derivatives of the divergences for the individual training instances w.r.t. parameters
, ()
()
input, and the derivative of the divergence for any input are independent
32
33
– Shown pictorially in the following slide
34
minibatch
– Every 𝑨 affects every 𝑨̂ – Shown on the next slide
in the batch
35
– Invoking mean and variance statistics across the minibatch
to compute the corresponding
36
𝜏
+ 𝜗
𝑨̂ = 𝛿𝑣 + 𝛾
– Invoking mean and variance statistics across the minibatch
to compute the corresponding
37
for each because the processing after the computation of
is independent for
each
𝑣 = 𝑨 − 𝜈 𝜏
+ 𝜗
𝑨̂ = 𝛿𝑣 + 𝛾
Parameters to be learned
41
Parameters to be learned
– Which is a vector operation over the minibatch
42
for every u
Batchnorm
– Which computes the centered “ ”s from the “ ”s for the minibatch
– The diagram represents BN occurring at a single neuron
43
44
45
Batch norm stage 1
46
Batch norm stage 1
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
76
77
78
79
80
81
82
83
84
, without the first “through” term
Batch norm stage 1
85
87
The rest of backprop continues from
.
𝜈 = 1 𝑂𝑐𝑏𝑢𝑑ℎ𝑓𝑡 𝜈(𝑐𝑏𝑢𝑑ℎ)
𝐶 (𝐶 − 1)𝑂𝑐𝑏𝑢𝑑ℎ𝑓𝑡 𝜏
(𝑐𝑏𝑢𝑑ℎ)
– 𝜈(𝑐𝑏𝑢𝑑ℎ) and 𝜏
(𝑐𝑏𝑢𝑑ℎ) here are obtained from the final converged network
– The 𝐶/(𝐶 − 1) term gives us an unbiased estimator for the variance
90
– Or even only selected neurons in the layer
– Anecdotal evidence that BN eliminates the need for dropout – To get maximum benefit from BN, learning rates must be increased and learning rate decay can be faster
– Also needs better randomization of training data order
91
92
93
94
95
Blue lines: error when function is below desired
Black lines: error when function is above desired
97
values
98
values
99
Find the function!
100
x y
101
x y
102
x y
103
x y These sharp changes happen because .. ..the perceptrons in the network are individually capable of sharp changes in output
104
105
x y
106
x y
107
108
weights
important it is for us to want to minimize the weights
– Make greater error on training data, to obtain a more acceptable network
109
– Randomly permute
– ∆𝑋
= 0
– For every layer 𝑙: » Compute 𝛼𝐸𝑗𝑤(𝑍
, 𝑒)
» ∆𝑋
= ∆𝑋 + 𝛼𝐸𝑗𝑤 𝑍 , 𝑒 𝑈
– For every layer k:
𝑋
= 𝑋 − 𝜃 ∆𝑋 + 𝜇𝑋
has converged
111
structure
smoothness than shallow ones
– Each layer works on the already smooth surface output by the previous layer
112
113
the same number of total neurons
– Implicit smoothness constraints
conventional regularization methods
114
6 layers 11 layers 3 layers 4 layers 6 layers 11 layers 3 layers 4 layers 10000 training instances
115
116
– Sample training data and train several different classifiers – Classify test instance with entire ensemble of classifiers – Vote across classifiers for final decision – Empirically shown to improve significantly over training a single classifier from combined data
117
Input Output
118
Input Output X1 Y1
119
– In practice, set them to 0 according to the failure of a Bernoulli random number generator with success probability a
Input Output X1 Y1
120
– In practice, set them to 0 according to the failure of a Bernoulli random number generator with success probability a
The pattern of dropped nodes changes for each input i.e. in every pass through the net
Input Output X1 Y1 Input Output X2 Y2 Input Output X3 Y3
121
network
– The effective network is different for different inputs – Gradients are obtained only for the weights and biases from “On” nodes to “On” nodes
The pattern of dropped nodes changes for each input i.e. in every pass through the net
Input Output X1 Y1 Input Output X2 Y2 Output X3 Y3 Input
122
Input Output X1 Y1 Input Output X2 Y2 Output X3 Y3 Input Output X1 Y1
123
learn “rich” and redundant patterns
compressive layer may just “clone” its input to its output
– Transferring the task of learning to the rest of the network upstream
learn denser patterns
– With redundancy
124
dimensional vector
–
–
(…)
# Mask takes value 1 with prob. , 0 with prob –
𝑥,
()𝑧 () +
() = 𝑔 𝑨
–
networks and is thus the statistical expectation of the output over all networks
Explicitly showing the network as a function of the outputs of individual neurons in the net
– Where 𝐹[𝑧
()] is the expected output of the jth neuron in the kth layer over all networks in
the ensemble – I.e. approximate the expectation of a function as the function of expectations
127
– Where is a Bernoulli variable that takes a value 1 with probability a
the ensemble, the expected output of the neuron is
– Consists of simply scaling the output of each neuron by a
128
129 Input Output X1 Y1
apply a here (to the output of the neuron) OR.. Push the a to all outgoing weights
𝑨
() = 𝑥
() +
()
= 𝑥
()
= a𝑥
()
𝒋 (𝒍)
a
𝒋 (𝒍)
Input Output X1 Y1
130
–
–
(…)
– For
, ()
–
132
– Randomly chosen units remain unchanged across a time transition
– Drop individual connections, instead of nodes
– Scale up the weights of randomly selected weights
– Fix remaining weights to a negative constant
– Add or multiply weight-dependent Gaussian noise to the signal on each connection
133
results
may be handled by batch normalization
handled by regularization and more constrained (generally deeper) network architectures
sometimes forces the network to learn more robust models
134
error epochs training validation
135
– When the divergence has a steep slope – This can result in instability
– Typical value is 5
136
Loss w
137
initialized will never diverge
138
– Use appropriate representation for inputs and outputs
– More neurons need more data – Deep is better, but harder to train
– Choose regularization
– E.g. ADAM
parameter, …) on held-out data
– Evaluate periodically on validation data, for early stopping if required
139
140