Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 1
Lecture 5: Training Neural Networks, Part I Fei-Fei Li & - - PowerPoint PPT Presentation
Lecture 5: Training Neural Networks, Part I Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - Lecture 5 - 20 Jan 2016 20 Jan 2016 1 Administrative A1 is due today
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 1
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 2
A1 is due today (midnight) I’m holding make up office hours on today: 5pm @ Gates 259 A2 will be released ~tomorrow. It’s meaty, but educational! Also:
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 3
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 4
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 5
1. Train on ImageNet
your own data
your data
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 6
ImageNet
all weights (treat CNN as fixed feature extractor), retrain only the classifier i.e. swap the Softmax layer at the end
dataset, “finetune” instead: use the old weights as initialization, train the full network or only some of the higher layers retrain bigger portion of the network, or even all of it.
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 7
https://github.com/BVLC/caffe/wiki/Model-Zoo
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 8
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 9
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 10
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 11
(image credits to Alec Radford)
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 12
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 13
“local gradient”
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 14
Graph (or Net) object. (Rough psuedo code)
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 15
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 16
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 17
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 18
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 19
“Fully-connected” layers “2-layer Neural Net”, or “1-hidden-layer Neural Net” “3-layer Neural Net”, or “2-hidden-layer Neural Net”
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 20
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 21
Frank Rosenblatt, ~1957: Perceptron
The Mark I Perceptron machine was the first implementation of the perceptron algorithm. The machine was connected to a camera that used 20×20 cadmium sulfide photocells to produce a 400-pixel image. recognized letters of the alphabet
update rule:
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 22
Widrow and Hoff, ~1960: Adaline/Madaline
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 23
Rumelhart et al. 1986: First time back-propagation became popular
recognizable maths
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 24
[Hinton and Salakhutdinov 2006]
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 25
Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition George Dahl, Dong Yu, Li Deng, Alex Acero, 2010 Imagenet classification with deep convolutional neural networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, 2012
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 26
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 27
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 28
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 29
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 30
have nice interpretation as a saturating “firing rate” of a neuron
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 31
have nice interpretation as a saturating “firing rate” of a neuron 3 problems: 1. Saturated neurons “kill” the gradients
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 32
sigmoid gate
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 33
have nice interpretation as a saturating “firing rate” of a neuron 3 problems: 1. Saturated neurons “kill” the gradients 2. Sigmoid outputs are not zero- centered
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 34
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 35
hypothetical
vector zig zag path
allowed gradient update directions allowed gradient update directions
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 36
have nice interpretation as a saturating “firing rate” of a neuron 3 problems: 1. Saturated neurons “kill” the gradients 2. Sigmoid outputs are not zero- centered 3. exp() is a bit compute expensive
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 37
[LeCun et al., 1991]
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 38
[Krizhevsky et al., 2012]
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 39
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 40
ReLU gate
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 41
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 42
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 43
[Mass et al., 2013] [He et al., 2015]
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 44
backprop into \alpha (parameter) [Mass et al., 2013] [He et al., 2015]
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 45
[Clevert et al., 2015]
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 46
[Goodfellow et al., 2013]
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 47
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 48
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 49
(Assume X [NxD] is data matrix, each example in a row)
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 50
(data has diagonal covariance matrix) (covariance matrix is the identity matrix)
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 51
Not common to normalize variance, to do PCA or whitening
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 52
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 53
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 54
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 55
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 56
E.g. 10-layer net with 500 neurons on each layer, using tanh non- linearities, and initializing as described in last slide.
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 57
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 58
Hint: think about backward pass for a W*X gate.
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 59
*1.0 instead of *0.01
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 60
“Xavier initialization” [Glorot et al., 2010] Reasonable initialization. (Mathematical derivation assumes linear activations)
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 61
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 62
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 63
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 64
Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio, 2010 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013 Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014 Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015 Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015 All you need is a good init, Mishkin and Matas, 2015 …
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 65
“you want unit gaussian activations? just make them so.”
[Ioffe and Szegedy, 2015]
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 66
“you want unit gaussian activations? just make them so.”
[Ioffe and Szegedy, 2015]
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 67
[Ioffe and Szegedy, 2015]
FC BN tanh FC BN tanh ...
Problem: do we necessarily want a unit gaussian input to a tanh layer?
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 68
[Ioffe and Szegedy, 2015] And then allow the network to squash the range if it wants to: Note, the network can learn: to recover the identity mapping. Normalize:
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 69
[Ioffe and Szegedy, 2015]
the network
in a funny way, and slightly reduces the need for dropout, maybe
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 70
[Ioffe and Szegedy, 2015] Note: at test time BatchNorm layer functions differently: The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used. (e.g. can be estimated during training with running averages)
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 71
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 72
(Assume X [NxD] is data matrix, each example in a row)
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 73
input layer hidden layer
CIFAR-10 images, 3072 numbers 10 output neurons, one per class 50 hidden neurons
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 74
returns the loss and the gradient for all parameters disable regularization loss ~2.3. “correct “ for 10 classes
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 75
crank up regularization loss went up, good. (sanity check)
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 76
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 77
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 78
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 79
Loss barely changing
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 80
Loss barely changing: Learning rate is probably too low
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 81
Loss barely changing: Learning rate is probably too low Notice train/val accuracy goes to 20% though, what’s up with that? (remember this is softmax)
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 82
Okay now lets try learning rate 1e6. What could possibly go wrong?
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 83
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 84
3e-3 is still too high. Cost explodes…. => Rough range for learning rate we should be cross-validating is somewhere [1e-3 … 1e-5]
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 85
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 86
First stage: only a few epochs to get rough idea of what params work Second stage: longer running time, finer search … (repeat as necessary) Tip for detecting explosions in the solver: If the cost is ever > 3 * original cost, break out early
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 87
nice
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 88
adjust range 53% - relatively good for a 2-layer neural net with 50 hidden neurons.
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 89
adjust range 53% - relatively good for a 2-layer neural net with 50 hidden neurons. But this best cross- validation result is
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 90
Random Search for Hyper-Parameter Optimization Bergstra and Bengio, 2012
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 91
neural networks practitioner music = loss function
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 92
My cross-validation “command center”
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 93
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 94
Loss time
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 95
Loss time
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 96
Loss function specimen
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 97
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 98
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 99
=> increase model capacity?
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 10
ratio between the values and updates: ~ 0.0002 / 0.02 = 0.01 (about okay) want this to be somewhere around 0.001 or so
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 10 1
(random sample hyperparams, in log space when appropriate)
Lecture 5 - 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 5 - 20 Jan 2016 10 2