Backpropagation Why backpropagation Neural networks are sequences - - PowerPoint PPT Presentation
Backpropagation Why backpropagation Neural networks are sequences - - PowerPoint PPT Presentation
Backpropagation Why backpropagation Neural networks are sequences of parametrized functions x ($; !) linear conv subsample conv subsample filters filters weights Parameters ! Why backpropagation Neural networks are
Why backpropagation
- Neural networks are sequences of parametrized
functions
conv filters subsample subsample conv linear filters weights Parameters !
x
ℎ($; !)
Why backpropagation
- Neural networks are sequences of parametrized
functions
- Parameters need to be set by minimizing some loss
function
Convolutional network
min
θ
1 N
N
X
i=1
L(h(xi; θ), yi)
Why backpropagation
- Neural networks are sequences of parametrized
functions
- Parameters need to be set by minimizing some loss
function
- Minimization through gradient descent requires
computing the gradient
θ(t+1) = θ(t) λ 1 N
N
X
i=1
rL(h(xi; θ), yi)
Why backpropagation
- Neural networks are sequences of parametrized
functions
- Parameters need to be set by minimizing some loss
function
- Minimization through gradient descent requires
computing the gradient
θ(t+1) = θ(t) λ 1 N
N
X
i=1
rL(h(xi; θ), yi)
z = h(x; θ)
rθL(z, y) = ∂L(z, y) ∂z ∂z ∂θ
Why backpropagation
- Neural networks are sequences of parametrized
functions
- Parameters need to be set by minimizing some loss
function
- Minimization through gradient descent requires
computing the gradient
- Backpropagation: way to compute gradient ∂z
∂θ
The gradient of convnets
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z
1
z
2
z
3
z
4
z5 = z
The gradient of convnets
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3
The gradient of convnets
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3
The gradient of convnets
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3
The gradient of convnets
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3 ∂z ∂z3 = ∂z ∂z4 ∂z4 ∂z3
The gradient of convnets
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
∂z ∂w3 = ∂z ∂z3 ∂z3 ∂w3 ∂z ∂z3 = ∂z ∂z4 ∂z4 ∂z3
The gradient of convnets
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
∂z ∂z2 = ∂z ∂z3 ∂z3 ∂z2 ∂z ∂w2 = ∂z ∂z2 ∂z2 ∂w2
Recurrence going backward!!
The gradient of convnets
f1 f2 f3 f4 f5 x w1 w2 w3 w4 w5 z1 z2 z3 z4 z5 = z
Backpropagation for a sequence
- f functions
- Assume we can compute partial derivatives of each function
- Use g(zi) to store gradient of z w.r.t zi, g(wi) for wi
- Calculate g(zi ) by iterating backwards
- Use g(zi) to compute gradient of parameters
zi = fi(zi−1, wi)
z0 = x
z = zn
∂zi ∂zi−1 = ∂fi(zi−1, wi) ∂zi−1
∂zi ∂wi = ∂fi(zi−1, wi) ∂wi
g(zn) = ∂z ∂zn = 1
g(zi−1) = ∂z ∂zi ∂zi ∂zi−1 = g(zi) ∂zi ∂zi−1
g(wi) = ∂z ∂zi ∂zi ∂wi = g(zi) ∂zi ∂wi
Loss as a function
conv filters subsample subsample conv linear filters weights loss label
Putting it all together: SGD training of ConvNets
- 1. Sample image and label
conv filters subsample subsample conv linear filters weights loss label Image
Putting it all together: SGD training of ConvNets
- 1. Sample image and label
- 2. Pass image through network to get loss (forward)
conv filters subsample subsample conv linear filters weights loss label Image
Putting it all together: SGD training of ConvNets
- 1. Sample image and label
- 2. Pass image through network to get loss (forward)
- 3. Backpropagate to get gradients (backward)
conv filters subsample subsample conv linear filters weights loss label Image
Putting it all together: SGD training of ConvNets
- 1. Sample image and label
- 2. Pass image through network to get loss (forward)
- 3. Backpropagate to get gradients (backward)
- 4. Take step along negative gradients to update
weights
conv filters subsample subsample conv linear filters weights loss label Image
Putting it all together: SGD training of ConvNets
- 1. Sample image and label
- 2. Pass image through network to get loss (forward)
- 3. Backpropagate to get gradients (backward)
- 4. Take step along negative gradients to update
weights
- 5. Repeat!
conv filters subsample subsample conv linear filters weights loss label Image
Beyond sequences: computation graphs
- Arbitrary graphs of functions
- No distinction between intermediate outputs and
parameters
f h g k l x y w u z
Computation graph - Functions
- Each node implements two functions
- A “forward”
- Computes output given input
- A “backward”
- Computes derivative of z w.r.t input, given derivative of z w.r.t
- utput
Computation graphs
fi a d c b
Computation graphs
fi
∂z ∂d
∂z ∂a
∂z ∂b ∂z ∂c
Computation graphs
fi a d c b
Computation graphs
fi
∂z ∂d
∂z ∂a
∂z ∂b ∂z ∂c
Neural network frameworks
Stochastic gradient descent
θ(t+1) θ(t) λ 1 K
K
X
k=1
rL(h(xik; θ(t)), yik)
Noisy!
Momentum
- Average multiple gradient steps
- Use exponential averaging
g(t) 1 K
K
X
k=1
rL(h(xik; θ(t)), yik) p(t) µg(t) + (1 µ)p(t−1) θ(t+1) θ(t) λp(t)
Weight decay
- Add −"# $ to the gradient
- Prevents # from growing to infinity
- Equivalent to L2 regularization of weights
Learning rate decay
- Large step size / learning
rate
- Faster convergence
initially
- Bouncing around at the
end because of noisy gradients
- Learning rate must be
decreased over time
- Usually done in steps
Convolutional network training
- Initialize network
- Sample minibatch of images
- Forward pass to compute loss
- Backpropagate loss to compute gradient
- Combine gradient with momentum and weight
decay
- Take step according to current learning rate
Setting hyperparameters
- How do we find a hyperparameter setting that
works?
- Try it!
- Train on train
- Test on test
- Picking hyperparameters that work for test =
Overfitting on test set
validation
Setting hyperparameters
Train Validation Test Training iterations Test on validation Pick new hyperparameters Test on test (Ideally only
- nce)
Vagaries of optimization
- Non-convex
- Local optima
- Sensitivity to initialization
- Vanishing / exploding gradients
- If each term is (much) greater than 1 à explosion of
gradients
- If each term is (much) less than 1 à vanishing gradients
∂z ∂zi = ∂z ∂zn−1 ∂zn−1 ∂zn−2 . . . ∂zi+1 ∂zi
Image Classification
How to do machine learning
- Create training / validation sets
- Identify loss functions
- Choose hypothesis class
- Find best hypothesis by minimizing training loss
How to do machine learning
- Create training / validation sets
- Identify loss functions
- Choose hypothesis class
- Find best hypothesis by minimizing training loss
h(x) = s
Multiclass classificatio n!!
ˆ p(y = k|x) ∝ esk ˆ p(y = k|x) = esk P
j esj
L(h(x), y) = − log ˆ p(y|x)
Building a convolutional network
conv + relu + subsample conv + relu + subsample conv + relu + subsample average pool linear 10 classes
Building a convolutional network
MNIST Classification
Method Error rate (%) Linear classifier over pixels 12 Kernel SVM over HOG 0.56 Convolutional Network 0.8
ImageNet
- 1000 categories
- ~1000 instances per category
Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 2015.
ImageNet
- Top-5 error: algorithm makes 5 predictions, true label
must be in top 5
- Useful for incomplete labelings
5 10 15 20 25 30 2010 2011 2012 Challenge winner's accuracy
Convolutional Networks