Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 1
Administrative
- A2 is out. It was late 2 days so due date
will be shifted by ~2 days.
- we updated the project page with many
Administrative - A2 is out. It was late 2 days so due date will be - - PowerPoint PPT Presentation
Administrative - A2 is out. It was late 2 days so due date will be shifted by ~2 days. - we updated the project page with many pointers to datasets. Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 6 - Lecture 6 - 21
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 1
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 2
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 3
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 4
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 5
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 6
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 7
Widrow and Hoff, ~1960: Adaline
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 8
Rumelhart et al. 1986: First time back-propagation became popular recognizable maths
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 9
[Hinton and Salkhutdinov 2006]
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 10
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 11
(Assume X [NxD] is data matrix, each example in a row)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 12
(data has diagonal covariance matrix) (covariance matrix is the identity matrix)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 13
input layer hidden layer
CIFAR-10 images, 3072 numbers 10 output neurons, one per class 50 hidden neurons
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 14
Warning: This is not optimal, but simplest! (More on this later)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 15
returns the loss and the gradient for all parameters disable regularization loss ~2.3. “correct “ for 10 classes
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 16
crank up regularization loss went up, good. (sanity check)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 17
The above code:
details:
learning rate will stay constant)
gradient descent, not mini-batch SGD
“epoch”: number of times we see the training set
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 18
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 19
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 20
Loss barely changing: Learning rate must be too low. (could also be reg too high) Notice train/val accuracy goes to 20% though, what’s up with that? (remember this is softmax)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 21
Okay now lets try learning rate 1e6. What could possibly go wrong?
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015
22
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 23
3e-3 is still too high. Cost explodes…. => Rough range for learning rate we should be cross-validating is somewhere [1e-3 … 1e-5]
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 24
First stage: only a few epochs to get rough idea of what params work Second stage: longer running time, finer search … (repeat as necessary) Tip for detecting explosions in the solver: If the cost is ever > 3 * original cost, break out early
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 25
nice
note it’s best to optimize in log space
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 26
adjust range 53% - relatively good for a 2-layer neural net with 50 hidden neurons.
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 27
adjust range 53% - relatively good for a 2-layer neural net with 50 hidden neurons. But this best cross- validation result is
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 28
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 29
If this looks too linear: learning rate is low. If it doesn’t decrease much: learning rate might be too high
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 30
If this looks too linear: learning rate is low. If it doesn’t decrease much: learning rate might be too high the “width” of the curve is related to the batch size. This one looks too wide (noisy) => might want to increase batch size
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 31
=> increase model capacity
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 32
ratio between the values and updates: ~ 0.0002 / 0.02 = 0.01 (about okay) want this to be somewhere around 0.01 - 0.001 or so
max min mean
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 33
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 34
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 35
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 36
neural networks practitioner music = loss function
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 37
Consider what happens to the output distribution of neurons with different number of inputs (low or high)
10 inputs 100 inputs
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 38
normalize by square root of the number of incoming connections (fan in) => ensures equal variance of each neuron in network (tricky, subtle, but very important topic. See notes for details)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 39
enforce maximum L2 norm
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 40
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 41
[Srivastava et al.]
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 42
Example forward pass with a 3- layer network using dropout
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 43
At test time all neurons are active always => We must scale the activations so that for each neuron:
if the output of a neuron is x but the probability of keeping it is
px + (1-p)0 = px (this has the interpretation as an ensemble of all subnetworks)
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 44
test time is unchanged
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 45
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 46
Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 Fei-Fei Li & Andrej Karpathy Lecture 6 - 21 Jan 2015 47