Optimization and (Under/over)fitting
EECS 442 โ David Fouhey Fall 2019, University of Michigan
http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/
(Under/over)fitting EECS 442 David Fouhey Fall 2019, University of - - PowerPoint PPT Presentation
Optimization and (Under/over)fitting EECS 442 David Fouhey Fall 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/ Regularized Least Squares Add regularization to objective that prefers some solutions: 2
http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/
Add regularization to objective that prefers some solutions:
arg min
๐
๐ โ ๐๐ 2
2
Loss Before: After:
arg min
๐
๐ โ ๐๐ 2
2 + ๐ ๐ 2 2
Loss Regularization Trade-off Want model โsmallerโ: pay a penalty for w with big norm Intuitive Objective: accurate model (low loss) but not too complex (low regularization). ฮป controls how much of each.
Known Images Labels
Test Image
(1) Compute distance between feature vectors (2) find nearest (3) use label.
What distance? What value for k / ฮป? Training Test Validation Use these data points for lookup Evaluate on these points for different k, ฮป, distances
big if cat ๐0
๐๐
big if dog ๐1
๐๐
big if hippo ๐2
๐๐
0.2
0.1 2.0 1.5 1.3 2.1 0.0 0.0 0.3 0.2
1.1 3.2
56 231 24 2 1
Cat weight vector Dog weight vector Hippo weight vector
437.9 61.95
Cat score Dog score Hippo score
Diagram by: Karpathy, Fei-Fei
Weight matrix a collection of scoring functions, one per class Prediction is vector where jth component is โscoreโ for jth class.
arg min
๐ฟ ๐ ๐ฟ ๐ ๐ + เท ๐=1 ๐
เท
๐ โ ๐ง๐
max 0, (๐ฟ๐๐ ๐ โ ๐ฟ๐๐ ๐ง๐)
Training (xi,yi):
Regularization Over all data points For every class j thatโs NOT the correct one (yi) Pay no penalty if prediction for class yi is bigger than j. Otherwise, pay proportional to the score of the wrong class.
Inference (x,y): arg max
k
๐ฟ๐ ๐
(Take the class whose weight vector gives the highest score)
arg min
๐ฟ ๐ ๐ฟ ๐ ๐ + เท ๐=1 ๐
เท
๐ โ ๐ง๐
max 0, (๐ฟ๐๐ ๐ โ ๐ฟ๐๐ ๐ง๐ + ๐)
Training (xi,yi):
Regularization Over all data points For every class j thatโs NOT the correct one (yi) Pay no penalty if prediction for class yi is bigger than j by m (โmarginโ). Otherwise, pay proportional to the score of the wrong class.
Inference (x,y): arg max
k
๐ฟ๐ ๐
(Take the class whose weight vector gives the highest score)
Useful book (Free too!): The Elements of Statistical Learning Hastie, Tibshirani, Friedman https://web.stanford.edu/~hastie/ElemStatLearn/
0.4 0.6
Cat score Dog score Hippo score
exp(x) e-0.9 e0.4 e0.6 0.41 1.49 1.82
โ=3.72
Norm 0.11 0.40 0.49
P(cat) P(dog) P(hippo)
Converting Scores to โProbability Distributionโ
Generally P(class j):
Inspired by: Karpathy, Fei-Fei
Called softmax function
Inference (x):
arg max
k
๐ฟ๐ ๐
(Take the class whose weight vector gives the highest score)
P(class j): Why can we skip the exp/sum exp thing to make a decision?
Inference (x):
arg max
k
๐ฟ๐ ๐
(Take the class whose weight vector gives the highest score)
Training (xi,yi):
arg min
๐ฟ ๐ ๐ฟ ๐ ๐ + เท ๐=1 ๐
โ log exp( ๐๐ฆ ๐ง๐) ฯ๐ exp( ๐๐ฆ ๐)) Regularization Over all data points P(correct class) Pay penalty for not making correct class likely. โNegative log-likelihoodโ
P(correct) = 1: No penalty! P(correct) = 0.9: 0.11 penalty P(correct) = 0.5: 0.11 penalty P(correct) = 0.05: 3.0 penalty
๐โ๐๐ ๐(๐)
๐ ๐ฟ = ๐ ๐ฟ ๐
๐ + เท ๐=1 ๐
โ log exp( ๐๐ฆ ๐ง๐) ฯ๐ exp( ๐๐ฆ ๐)) ๐(๐) = ๐ ๐ 2
2 + เท ๐=1 ๐
๐ง๐ โ ๐๐ผ๐๐
2
๐ ๐ = ๐ท ๐ 2
2 + เท ๐=1 ๐
max 0,1 โ ๐ง๐๐๐๐๐
Works for lots of different Ls:
Global minimum
function (called the Booth Function) and other more-learning-focused functions
principle
space has qualitative differences from 1000D space
See intro of: Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization NIPS 2014 https://ganguli-gang.stanford.edu/pdf/14.SaddlePoint.NIPS.pdf
function evaluation
we can easily see the answer
may take hours to evaluate
20
20
Landscape diagram: Karpathy and Fei-Fei
#systematically try things best, bestScore = None, Inf for dim1Value in dim1Values: โฆ. for dimNValue in dimNValues: w = [dim1Value, โฆ, dimNValue] if L(w) < bestScore: best, bestScore = w, L(w) return best
Pros:
able to evaluate model Cons:
dimensional spaces
#Do random stuff RANSAC Style best, bestScore = None, Inf for iter in range(numIters): w = random(N,1) #sample score = ๐ ๐ #evaluate if score < bestScore: best, bestScore = w, score return best
Pros:
able to sample model and evaluate it Cons:
at high dimensional dart board
Good parameters All parameters
Use these when
accuracy if we use this distance function) Random search is arguably more effective; grid search makes it easy to systematically test something (people love certainty)
๐ ๐(๐)
Whatโs the geometric interpretation of: Want: Which is bigger (for small ฮฑ)?
Given starting point (blue) wi+1 = wi + -9.8x10-2 x gradient
โ๐๐ ๐ = ๐๐(๐ฅ) ๐๐ฆ1 โฎ ๐๐(๐ฅ) ๐๐ฆ๐ ๐๐(๐ฆ) ๐๐ฆ = lim
๐โ0
๐ ๐ฆ + ๐ โ ๐(๐ฆ) ๐ How do you compute this?
๐ ๐ฆ + ๐ โ ๐ ๐ฆ โ ๐ 2๐
In practice, use:
โ๐๐ ๐ = ๐๐(๐ฅ) ๐๐ฆ1 โฎ ๐๐(๐ฅ) ๐๐ฆ๐ How many function evaluations per dimension?
๐ ๐ฆ + ๐ โ ๐ ๐ฆ โ ๐ 2๐
Use:
โ๐๐ ๐ = ๐๐(๐ฅ) ๐๐ฆ1 โฎ ๐๐(๐ฅ) ๐๐ฆ๐
2 + เท ๐=1 ๐
2
๐=1 ๐
Note: if you look at other derivations, things are written either (y-wTx) or (wTx โ y); the gradients will differ by a minus.
Recall: w = w + -โ๐๐ ๐ #update w โ๐๐(๐) = 2๐๐ + โ 2 ๐ง โ ๐๐๐ ๐ โโ๐๐ ๐ = โ2๐๐ + 2 ๐ง โ ๐๐๐ ๐
Push w towards 0
If y > wTx (too low): then w = w + ฮฑx for some ฮฑ Before: wTx After: (w+ ฮฑx)Tx = wTx + ฮฑxTx
๐ ๐๐ฆ ๐ ๐ฆ = sign(๐ฆ) ๐ฆ โ 0 Derivatives/Gradients Defined everywhere but 0 ๐ฆ = 0
undefined
Oh no! A discontinuity!
Subderivatives/subgradients Defined everywhere In practice: at discontinuity, pick value on either side. ๐ ๐๐ฆ ๐ ๐ฆ = sign(๐ฆ) ๐ฆ โ 0
๐ฆ = 0 ๐ ๐๐ฆ ๐ ๐ฆ โ [โ1,1]
numerical (called a gradient check)
Slide: Karpathy and Fei-Fei
โโ๐๐ ๐ = โ2๐๐ + เท
๐=1 ๐
2 ๐ง๐ โ ๐๐๐๐ ๐๐
All Data
โโ๐๐๐ถ ๐ = โ2๐๐ + เท
๐โ๐ถ
2 ๐ง๐ โ ๐๐๐๐ ๐๐
Subset B
10x10-2 converges 12x10-2 diverges 1x10-2 falls short
11x10-2 :oscillates (Raw gradients)
11x10-2 :oscillates (Raw gradients)
Solution: Average gradients
With exponentially decaying weights, called โmomentumโ
11x10-2 (0.25 momentum) 11x10-2 :oscillates (Raw gradients)
Multiple Minima โ Gradient Descent Finds local minimum
Guess the minimum!
Guess the minimum!
Dynamics are fairly complex Many important functions are convex: any local minimum is a global minimum Many important functions are not.
gradient descent (SGD) + momentum (package implements it for you) + some sensibly changing learning rate
not clear (sometimes better, sometimes worse)
maximize training accuracy
maximize validation accuracy
๐ ๐ฟ = ๐ ๐ฟ ๐
๐ + เท ๐=1 ๐
โ log exp( ๐๐ฆ ๐ง๐) ฯ๐ exp( ๐๐ฆ ๐)) ๐(๐) = ๐ ๐ 2
2 + เท ๐=1 ๐
๐ง๐ โ ๐๐ผ๐๐
2
Letโs fit a polynomial: given x, predict y Note: can do non-linear regression with copies of x
๐บ
๐บ
2
2
Weights: one per polynomial degree Matrix of all polynomial degrees
Model: 1.5x2 + 2.3x+2 + N(0,0.5)
Model: 1.5x2 + 2.3x+2 + N(0,0.5)
Model: 1.5x2 + 2.3x+2 + N(0,0.5)
Model has high variance: remove one point, and model changes dramatically
arg min
๐ฟ ๐ ๐ฟ ๐ ๐ + เท ๐=1 ๐
โ log exp( ๐๐ฆ ๐ง๐) ฯ๐ exp( ๐๐ฆ ๐)) Regularization: penalty for complex model Pay penalty for negative log- likelihood of correct class
Intuitively: big weights = more complex model Model 1: 0.01*x1 + 1.3*x2 + -0.02*x3 + -2.1x4 + 10 Model 2: 37.2*x1 + 13.4*x2 + 5.6*x3 + -6.1x4 + 30
Again, fitting polynomial, but with regularization
w
๐บ
๐บ
2
2
No regularization: fits all data points Regularization: canโt fit all data points
Error on new data comes from combination of:
underlying data
estimate your model from limited data
Bias and variance trade-off. Fixing one hurts the
Low Bias High Variance High Bias Low Variance
Diagram adapted from: D. Hoiem
Test Error Training Error
Underfitting!
Low Bias High Variance High Bias Low Variance
Diagram adapted from: D. Hoiem
Small data Overfits w/small model Big data Overfits w/bigger model
Do poorly on both training and validation data due to bias. Solution:
Do well on training data, but poorly
Solution:
Cris Dima rule: first make sure you can overfit, then stop overfitting.
at on your own time if youโre curious
arg min
๐ฟ ๐ ๐ฟ ๐ ๐ + เท ๐=1 ๐
เท
๐ โ ๐ง๐
max 0, (๐ฟ๐๐ ๐ โ ๐ฟ๐๐ ๐ง๐ + ๐)
Notation: W โ rows wi (i.e., per-class scorer) (Wxi)j โ wj
Txi
arg min
๐ฟ ๐ เท ๐=๐ ๐ณ
๐๐ ๐
๐ + เท ๐=1 ๐
เท
๐ โ ๐ง๐
max(0, ๐๐
๐๐๐ โ ๐๐ง๐ ๐ ๐๐ + ๐)
Derivation setup: Karpathy and Fei-Fei
๐
arg min
๐ฟ ๐ เท ๐=๐ ๐ณ
๐๐ ๐
๐ + เท ๐=1 ๐
เท
๐ โ ๐ง๐
max(0, ๐๐
๐๐๐ โ ๐๐ง๐ ๐ ๐๐ + ๐)
๐ ๐๐ฆ๐ โ ๐ฅ๐ง๐ ๐ ๐ฆ๐ + ๐ โค 0:
๐ ๐๐ฆ๐ โ ๐ฅ๐ง๐ ๐ ๐ฆ๐ + ๐ > 0:
๐ ๐ฆ๐ โ ๐ฅ๐ง๐ ๐ ๐ฆ๐ + ๐ > 0)๐ฆ๐
Derivation setup: Karpathy and Fei-Fei
arg min
๐ฟ ๐ เท ๐=๐ ๐ณ
๐๐ ๐
๐ + เท ๐=1 ๐
เท
๐ โ ๐ง๐
max(0, ๐๐
๐๐๐ โ ๐๐ง๐ ๐ ๐๐ + ๐)
Derivation setup: Karpathy and Fei-Fei
๐โ ๐ง๐
๐ ๐๐ฆ๐ โ ๐ฅ๐ง๐ ๐ ๐ฆ๐ + ๐ > 0
If we do not predict the correct class by at least a score difference of m โฆ Want incorrect classโs scoring vector to score that point lower. Recall: Before: wTx; After: (w-ฮฑx)Tx = wTx - axTx
Derivation setup: Karpathy and Fei-Fei
๐
๐๐ฆ๐ โ ๐ฅ๐ง๐ ๐ ๐ฆ๐ + ๐ > 0) โ๐ฆ๐