under over fitting
play

(Under/over)fitting EECS 442 David Fouhey Fall 2019, University of - PowerPoint PPT Presentation

Optimization and (Under/over)fitting EECS 442 David Fouhey Fall 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/ Regularized Least Squares Add regularization to objective that prefers some solutions: 2


  1. Optimization and (Under/over)fitting EECS 442 โ€“ David Fouhey Fall 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/

  2. Regularized Least Squares Add regularization to objective that prefers some solutions: 2 arg min ๐’› โˆ’ ๐’€๐’™ 2 Loss Before: ๐’™ 2 + ๐œ‡ ๐’™ 2 2 arg min ๐’› โˆ’ ๐’€๐’™ 2 After: ๐’™ Loss Trade-off Regularization Want model โ€œsmallerโ€: pay a penalty for w with big norm Intuitive Objective: accurate model (low loss) but not too complex (low regularization). ฮป controls how much of each.

  3. Nearest Neighbors Test Known Images Image Labels ๐ธ(๐’š 1 , ๐’š ๐‘ˆ ) ๐’š 1 ๐’š ๐‘ˆ Cat Cat! โ€ฆ ๐ธ(๐’š ๐‘‚ , ๐’š ๐‘ˆ ) (1) Compute distance between feature vectors (2) find nearest ๐’š ๐‘‚ Dog (3) use label.

  4. Picking Parameters What distance? What value for k / ฮป ? Training Validation Test Use these data Evaluate on these points for lookup points for different k, ฮป , distances

  5. Linear Models Example Setup: 3 classes Model โ€“ one weight per class: ๐’™ 0 , ๐’™ 1 , ๐’™ 2 ๐‘ˆ ๐’š big if cat ๐’™ 0 Want: ๐‘ˆ ๐’š big if dog ๐’™ 1 ๐‘ˆ ๐’š big if hippo ๐’™ 2 Stack together: where x is in R F ๐‘ฟ ๐Ÿ’๐’š๐‘ฎ

  6. Linear Models Cat weight vector Cat score 0.2 -0.5 0.1 2.0 1.1 56 -96.8 Dog weight vector Dog score 1.5 1.3 2.1 0.0 3.2 231 437.9 Hippo weight vector Hippo score 0.0 0.3 0.2 -0.3 -1.2 24 61.95 2 ๐‘ฟ๐’š ๐’‹ ๐‘ฟ 1 Prediction is vector Weight matrix a collection of where jth component is scoring functions, one per class ๐’š ๐’‹ โ€œscoreโ€ for jth class. Diagram by: Karpathy, Fei-Fei

  7. Objective 1: Multiclass SVM* (Take the class whose Inference (x,y): arg max ๐‘ฟ๐’š ๐‘™ weight vector gives the k highest score) Training ( x i ,y i ): ๐‘œ ๐Ÿ‘ + เท arg min ๐‘ฟ ๐ ๐‘ฟ ๐Ÿ‘ เท max 0, (๐‘ฟ๐’š ๐‘— ๐‘˜ โˆ’ ๐‘ฟ๐’š ๐’‹ ๐‘ง ๐‘— ) ๐‘—=1 ๐‘˜ โ‰ ๐‘ง ๐‘— Regularization Pay no penalty if prediction Over all data for class y i is bigger than j. points Otherwise, pay proportional For every class to the score of the wrong j thatโ€™s NOT the class. correct one (y i )

  8. Objective 1: Multiclass SVM (Take the class whose Inference (x,y): arg max ๐‘ฟ๐’š ๐‘™ weight vector gives the k highest score) Training ( x i ,y i ): ๐‘œ ๐Ÿ‘ + เท arg min ๐‘ฟ ๐ ๐‘ฟ ๐Ÿ‘ เท max 0, (๐‘ฟ๐’š ๐‘— ๐‘˜ โˆ’ ๐‘ฟ๐’š ๐’‹ ๐‘ง ๐‘— + ๐‘›) ๐‘—=1 ๐‘˜ โ‰ ๐‘ง ๐‘— Regularization Pay no penalty if prediction Over all data for class y i is bigger than j points by m (โ€œmarginโ€). Otherwise, For every class pay proportional to the j thatโ€™s NOT the score of the wrong class. correct one (y i )

  9. Objective 1: Called: Support Vector Machine Lots of great theory as to why this is a sensible thing to do. See Useful book (Free too!): The Elements of Statistical Learning Hastie, Tibshirani, Friedman https://web.stanford.edu/~hastie/ElemStatLearn/

  10. Objective 2: Making Probabilities Converting Scores to โ€œProbability Distributionโ€ -0.9 e -0.9 0.41 0.11 Cat score P(cat) Norm 0.4 exp(x) e 0.4 1.49 0.40 Dog score P(dog) 0.6 e 0.6 1.82 0.49 Hippo score P(hippo) โˆ‘=3.72 exp (๐‘‹๐‘ฆ ๐‘˜ ) Generally P(class j): ฯƒ ๐‘™ exp( ๐‘‹๐‘ฆ ๐‘™ ) Called softmax function Inspired by: Karpathy, Fei-Fei

  11. Objective 2: Softmax (Take the class whose Inference (x): arg max ๐‘ฟ๐’š ๐‘™ weight vector gives the k highest score) exp (๐‘‹๐‘ฆ ๐‘˜ ) P(class j): ฯƒ ๐‘™ exp( ๐‘‹๐‘ฆ ๐‘™ ) Why can we skip the exp/sum exp thing to make a decision?

  12. Objective 2: Softmax (Take the class whose Inference (x): arg max ๐‘ฟ๐’š ๐‘™ weight vector gives the k highest score) Training ( x i ,y i ): P(correct class) ๐‘œ exp( ๐‘‹๐‘ฆ ๐‘ง ๐‘— ) ๐Ÿ‘ + เท arg min ๐‘ฟ ๐ ๐‘ฟ ๐Ÿ‘ โˆ’ log ฯƒ ๐‘™ exp( ๐‘‹๐‘ฆ ๐‘™ )) ๐‘—=1 Regularization Over all data Pay penalty for not making points correct class likely. โ€œNegative log - likelihoodโ€

  13. Objective 2: Softmax P(correct) = 0.05: 3.0 penalty P(correct) = 0.5: 0.11 penalty P(correct) = 0.9: 0.11 penalty P(correct) = 1: No penalty!

  14. How Do We Optimize Things? Goal: find the w minimizing arg min ๐’™โˆˆ๐‘† ๐‘‚ ๐‘€(๐’™) some loss function L. ๐‘œ exp( ๐‘‹๐‘ฆ ๐‘ง ๐‘— ) ๐Ÿ‘ + เท ๐‘€ ๐‘ฟ = ๐ ๐‘ฟ ๐Ÿ‘ โˆ’ log ฯƒ ๐‘™ exp( ๐‘‹๐‘ฆ ๐‘™ )) Works for lots of ๐‘—=1 ๐‘œ different Ls: 2 2 + เท ๐‘ง ๐‘— โˆ’ ๐’™ ๐‘ผ ๐’š ๐’‹ ๐‘€(๐’™) = ๐œ‡ ๐’™ 2 ๐‘—=1 ๐‘œ 2 + เท max 0,1 โˆ’ ๐‘ง ๐‘— ๐’™ ๐‘ˆ ๐’š ๐’‹ ๐‘€ ๐’™ = ๐ท ๐’™ 2 ๐‘—=1

  15. Sample Function to Optimize f(x,y) = (x+2y-7) 2 + (2x+y-5) 2 Global minimum

  16. Sample Function to Optimize โ€ข Iโ€™ll switch back and forth between this 2D function (called the Booth Function ) and other more-learning-focused functions โ€ข Beauty of optimization is that itโ€™s all the same in principle โ€ข But donโ€™t draw too many conclusions: 2D space has qualitative differences from 1000D space See intro of: Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization NIPS 2014 https://ganguli-gang.stanford.edu/pdf/14.SaddlePoint.NIPS.pdf

  17. A Caveat โ€ข Each point in the picture is a function evaluation โ€ข Here it takes microseconds โ€“ so we can easily see the answer โ€ข Functions we want to optimize may take hours to evaluate

  18. A Caveat Model in your head: moving around a landscape with a teleportation device 20 -20 -20 20 Landscape diagram: Karpathy and Fei-Fei

  19. Option #1A โ€“ Grid Search #systematically try things best, bestScore = None, Inf for dim1Value in dim1Values: โ€ฆ. for dimNValue in dimNValues: w = [dim1Value, โ€ฆ, dimNValue] if L( w ) < bestScore: best, bestScore = w , L( w ) return best

  20. Option #1A โ€“ Grid Search

  21. Option #1A โ€“ Grid Search Pros : Cons : 1. Super simple 1. Scales horribly to high 2. Only requires being dimensional spaces able to evaluate model Complexity: samplesPerDim numberOfDims

  22. Option #1B โ€“ Random Search #Do random stuff RANSAC Style best, bestScore = None, Inf for iter in range(numIters): w = random(N,1) #sample score = ๐‘€ ๐’™ #evaluate if score < bestScore: best, bestScore = w , score return best

  23. Option #1B โ€“ Random Search

  24. Option #1B โ€“ Random Search Pros : Cons : 1. Super simple 1. Slow โ€“ throwing darts 2. Only requires being at high dimensional able to sample model dart board and evaluate it 2. Might miss something P(all correct) = ฮต N ฮต Good parameters All parameters 1 0

  25. When Do You Use Options 1A/1B? Use these when โ€ข Number of dimensions small, space bounded โ€ข Objective is impossible to analyze (e.g., test accuracy if we use this distance function) Random search is arguably more effective; grid search makes it easy to systematically test something (people love certainty)

  26. Option 2 โ€“ Use The Gradient Arrows: gradient

  27. Option 2 โ€“ Use The Gradient Arrows: gradient direction (scaled to unit length)

  28. Option 2 โ€“ Use The Gradient arg min ๐’™ ๐‘€(๐’™) Want: ๐œ–๐‘€/๐œ–๐’š 1 Whatโ€™s the geometric โ‹ฎ โˆ‡ ๐’™ ๐‘€ ๐’™ = interpretation of: ๐œ–๐‘€/๐œ–๐’š ๐‘‚ Which is bigger (for small ฮฑ )? โ‰ค? ๐‘€ ๐’™ ๐‘€ ๐’™ + ๐›ฝโˆ‡ ๐’™ ๐‘€(๐’™) >?

  29. Option 2 โ€“ Use The Gradient Arrows: gradient direction (scaled to unit length) ๐’š ๐’š + ๐›ฝ๐›‚

  30. Option 2 โ€“ Use The Gradient Method: at each step, move in direction of negative gradient w0 = initialize () #initialize for iter in range(numIters): g = โˆ‡ ๐’™ ๐‘€ ๐’™ #eval gradient w = w + - stepsize (iter)* g #update w return w

  31. Gradient Descent Given starting point (blue) w i+1 = w i + -9.8x10 -2 x gradient

  32. Computing the Gradient How Do You Compute The Gradient? Numerical Method: How do you compute this? ๐œ–๐‘€(๐‘ฅ) ๐œ–๐‘”(๐‘ฆ) ๐‘” ๐‘ฆ + ๐œ— โˆ’ ๐‘”(๐‘ฆ) ๐œ–๐‘ฆ 1 = lim ๐œ–๐‘ฆ ๐œ— ๐œ—โ†’0 โˆ‡ ๐’™ ๐‘€ ๐’™ = โ‹ฎ ๐œ–๐‘€(๐‘ฅ) In practice, use: ๐œ–๐‘ฆ ๐‘œ ๐‘” ๐‘ฆ + ๐œ— โˆ’ ๐‘” ๐‘ฆ โˆ’ ๐œ— 2๐œ—

  33. Computing the Gradient How Do You Compute The Gradient? Numerical Method: ๐‘” ๐‘ฆ + ๐œ— โˆ’ ๐‘” ๐‘ฆ โˆ’ ๐œ— Use: ๐œ–๐‘€(๐‘ฅ) 2๐œ— ๐œ–๐‘ฆ 1 โˆ‡ ๐’™ ๐‘€ ๐’™ = โ‹ฎ How many function ๐œ–๐‘€(๐‘ฅ) evaluations per dimension? ๐œ–๐‘ฆ ๐‘œ

  34. Computing the Gradient How Do You Compute The Gradient? Analytical Method: ๐œ–๐‘€(๐‘ฅ) ๐œ–๐‘ฆ 1 Use calculus! โˆ‡ ๐’™ ๐‘€ ๐’™ = โ‹ฎ ๐œ–๐‘€(๐‘ฅ) ๐œ–๐‘ฆ ๐‘œ

  35. Computing the Gradient ๐‘œ 2 2 + เท ๐‘ง ๐‘— โˆ’ ๐’™ ๐‘ผ ๐’š ๐’‹ ๐‘€(๐’™) = ๐œ‡ ๐’™ 2 ๐‘—=1 ๐œ– ๐œ–๐’™ ๐‘œ โˆ’ 2 ๐‘ง ๐‘— โˆ’ ๐’™ ๐‘ˆ ๐’š ๐’‹ ๐’š ๐‘— โˆ‡ ๐’™ ๐‘€(๐’™) = 2๐œ‡๐’™ + เท ๐‘—=1 Note: if you look at other derivations, things are written either (y-w T x) or (w T x โ€“ y); the gradients will differ by a minus.

  36. Interpreting Gradients (1 Sample) Recall: w = w + - โˆ‡ ๐’™ ๐‘€ ๐’™ #update w โˆ‡ ๐’™ ๐‘€(๐’™) = 2๐œ‡๐’™ + โˆ’ 2 ๐‘ง โˆ’ ๐’™ ๐‘ˆ ๐’š ๐’š ฮฑ Push w towards 0 โˆ’โˆ‡ ๐’™ ๐‘€ ๐’™ = โˆ’2๐œ‡๐’™ + 2 ๐‘ง โˆ’ ๐’™ ๐‘ˆ ๐’š ๐’š If y > w T x (too low ): then w = w + ฮฑ x for some ฮฑ Before : w T x After : (w+ ฮฑ x) T x = w T x + ฮฑ x T x

  37. Quick annoying detail: subgradients What is the derivative of |x|? |x| Derivatives/Gradients Defined everywhere but 0 ๐œ– ๐‘ฆ โ‰  0 ๐œ–๐‘ฆ ๐‘” ๐‘ฆ = sign(๐‘ฆ) undefined ๐‘ฆ = 0 Oh no! A discontinuity!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend