Lectu ture 7 Recap
1
- Prof. Leal-Taixé and Prof. Niessner
Lectu ture 7 Recap Prof. Leal-Taix and Prof. Niessner 1 Bey - - PowerPoint PPT Presentation
Lectu ture 7 Recap Prof. Leal-Taix and Prof. Niessner 1 Bey Beyon ond l linea ear 1-layer network: f = Wx x W f 10 128128 Prof. Leal-Taix and Prof. Niessner 2 Ne Neural Ne Netw twork Width Depth Prof. Leal-Taix and
1
2
1-layer network: x W 128×128 f 10
f = Wx
Depth
3
Width
4
5
6
Loss (Softmax, Hinge) Prediction
What is the shape of this function?
7
x0 x1 x2 X
Can be interpreted as a probability 1
p(yi = 1|xi, θ)
θ0 θ1 θ2 σ(x) = 1 1 + e−x
x0 x1 x2 X
Πi
θ0 θ1 θ2
8
9
Minimization
= σ(xiθ)
L(Πi, yi) = yi log Πi + (1 − yi) log(1 − Πi) C(θ) = − 1 n
n
X
i=1
yi log Πi + (1 − yi) log(1 − Πi)
10
C(θ) = − 1 n
n
X
i=1
yi log Πi + (1 − yi) log(1 − Πi) C(θ) = − 1 n
n
X
i=1 M
X
c=1
yi,c log pi,c c
Binary indicator whether is the label for image i Probability given by our sigmoid function
11
Softmax
Π2 Π1
Π3
x0 x1 x2
12
Π2 Π1
Π3
x0 x1 x2
case, because all outputs need to sum to 1
13
C(θ) = − 1 n
n
X
i=1 M
X
c=1
yi,c log pi,c pi,c Πi X
c
Πi,c
probabilities (M is the number of classes)
14
Π2 Π1
Π3 p(dog|Xi) p(cat|Xi) p(bird|Xi) = escat P
c esc
Score for class cat given by all the layers of the network
Normalization
15
Li = − log ✓ esyi P
k esk
◆
Evaluate the ground truth score for the image
Li = X
k6=yi
max(0, sk − syi + 1)
Comes from Maximum Likelihood Estimate
– Optimizes until the loss is zero
– Saturates whenever it has learned a class “well enough”
16
17
Forward
18
σ(x) = 1 1 + e−x ∂L ∂σ ∂σ ∂x ∂L ∂x = ∂σ ∂x ∂L ∂σ x = 6
Saturated neurons kill the gradient flow
19
w1 w2
More on zero- mean data later
20
Zero- centered Still saturates Still saturates
LeCun 1991
21
Large and consistent gradients Does not saturate Fast convergence What happens if a ReLU outputs zero? Dead ReLU
22
Generalization
Linear regimes Does not die Does not saturate Increase of the number of parameters
23
For images subtract the mean image (AlexNet) or per- channel mean (VGG-Net)
24
Forward
25
w w w w
Optimum
26
Initialization Not guaranteed to reach the
Forward
27
w w w w w = 0 f X
i
wixi + b !
What happens to the gradients? No symmetry breaking
the same function, gradients are going to be the same
28
– Network with 10 layers with 500 neurons each – Tanh as activation functions – Input unit Gaussian data
29
Forward
30
Input Last layer Activations become zero
Forward
31
f X
i
wixi + b !
small
Backward
32
f X
i
wixi + b !
function gradient is ok
gradients wrt the weights
33
f X
i
wixi + b !
function gradient is ok
gradients wrt the weights Gradients vanish
– Network with 10 layers with 500 neurons each – Tanh as activation functions – Input unit Gaussian data
34
35
Everything is saturated
36
deviation?
37
Var(s) = Var(
n
X
i
wixi) =
n
X
i
Var(wixi)
n
X
Glorot 2010
deviation?
38
Var(s) = Var(
n
X
i
wixi) =
n
X
i
Var(wixi) =
n
X
i
[E(wi)]2Var(xi) + E[(xi)]2Var(wi) + Var(xi)Var(wi)
n
Independent Zero mean
deviation?
39
Var(s) = Var(
n
X
i
wixi) =
n
X
i
Var(wixi) =
n
X
i
[E(wi)]2Var(xi) + E[(xi)]2Var(wi) + Var(xi)Var(wi) =
n
X
i
Var(xi)Var(wi) = (nVar(w)) Var(x)
Identically distributed
deviation?
40
Var(s) = Var(
n
X
i
wixi) =
n
X
i
Var(wixi) =
n
X
i
[E(wi)]2Var(xi) + E[(xi)]2Var(wi) + Var(xi)Var(wi) =
n
X
i
Var(xi)Var(wi) = (nVar(w)) Var(x)
Variance gets multiplied by the number of inputs
as the input?
41
(nVar(w)) Var(x)
1
V ar(w) = 1 n
42
Mitigates the effect of activations going to zero
43
44
V ar(w) = 2 n
He 2015
45
V ar(w) = 2 n
He 2015
It makes a huge difference!
46
47
49
ˆ x(k) = x(k) − E[x(k)] p Var[x(k)]
D = #features N = mini-batch size Ioffe and Szegedy 2015
Mean of your mini-batch examples over feature k
gaussian (in our example)
50
ˆ x(k) = x(k) − E[x(k)] p Var[x(k)]
D = #features N = mini-batch size Mean of your mini-batch examples over feature k Ioffe and Szegedy 2015
gaussian (in our example)
variance of the inputs to your activation functions
51
Ioffe and Szegedy 2015
Connected (or Convolutional) layers and be before non-linear activation functions
Gaussians before tanh? This normalization might not be the best for the network!
52
Ioffe and Szegedy 2015
53
Ioffe and Szegedy 2015
ˆ x(k) = x(k) − E[x(k)] p Var[x(k)] y(k) = γ(k)ˆ x(k) + β(k) These parameters will be
Differentiable function so we can backprop through it….
the range
54
Ioffe and Szegedy 2015
ˆ x(k) = x(k) − E[x(k)] p Var[x(k)] y(k) = γ(k)ˆ x(k) + β(k)
backprop
γ(k) = q Var[x(k)] β(k) = E[x(k)]
The network can learn to undo the normalization
Is it it ok to to tr treat t dim imensio ions separate tely? Shown empirically that even if features are not decorrelated, convergence is still faster with this method
because they will be cancelled out by BN anyway
55
Ioffe and Szegedy 2015
batch
image at a time?
– No chance to compute a meaningful mean and variance
56
ˆ x(k) = x(k) − E[x(k)] p Var[x(k)]
57
Tr Train inin ing Test sting
variance from mini- batch 1
variance from mini- batch 2
variance from mini- batch 3
variance by running an exponentially weighted averaged across training mini-batches
µtest σ2
test
stable gradients
similarly when using BN
58
60
61
Credits: Deep Learning. Goodfellow et al.
62
Credits: Deep Learning. Goodfellow et al.
Training error too big Generalization gap is too big
63
64
Learning rate Gradient
−λθT
k θk
θ θ/2 θ/2
transformations
65
Pose Appearance Illumination
66
transformations
plausible transformations
67
68
Krizhevsky 2012
69
Krizhevsky 2012
– Pick a random L in [256,480] – Resize training image, short side L – Randomly sample crops of 224x224
– Resize image at N scales – 10 fixed crops of 224x224: 4 corners + center + flips
70
Krizhevsky 2012
same data augmentation!
design
71
Training time is also a hyperparameter
72
Overfitting
73
θ0 θ∗
Overfitting
✏ θ1 ✏ θ2 τ θs
change the objective function
error will decrease linearly with the ensemble size
74
75
Training Set 1 Training Set 2 Training Set 3
76
77
Srivastava 2014
Forward
78
Furry Has two eyes Has a tail Has paws Has two ears Redundant representations
– Redundant representations – Base your scores on more features
79
80
Model 1 Model 2
– Redundant representations – Base your scores on more features
– Training a large ensemble of models, each on different set of data (mini-batch) and with SHARED parameters
81
Reducing co-adaptation between neurons
82
Conditions at train and test time are not the same
83
x y z θ1 θ2 z = θ1x + θ2y E[z] = 1 4(θ10 + θ20 +θ1x + θ20 +θ10 + θ2y +θ1x + θ2y) = 1 2(θ1x + θ2y)
Dropout probability p=0.5 Weight scaling inference rule
larger models, more training time
84
85
Depth Width
86
x0 x1 x2 X θ0 θ1 θ2
Concept of a ‘Neuron’
87
Activation Functions (non-linearities)
Sigmoid: ! " =
$ ($&'())
tanh: tanh " ReLU: max 0, " Leaky ReLU: max 0.1", "
88
!" #" !$ #$ %
*−1 + 1 # )* ∗ ∗ + 2.00 −1.00 −2.00 −3.00 −2.00 6.00 +1 4.00 −3.00 −1.00 1.00 0.37 1.37 0.73 1.00 −0.53 −0.53 −0.20 0.20 0.20 0.20 0.20 0.20 −0.20 −0.39 −0.39 −0.59
Backpropagation
89
SGD Variations (Momentum, etc.)
90
Dropout Batch-Norm Weight Regularization Data Augmentation
ˆ x(k) = x(k) − E[x(k)] p Var[x(k)]
Weight Initialization (e.g., Xavier/2)
e.g., !"-reg: #" $ = ∑'()
*
+'
"
91
– Why not just go deeper and get better? – No structure!! – It’s just brute force! – Optimization becomes hard – Performance plateaus / drops!
92
93