(Under/over)fitting EECS 442 David Fouhey Fall 2019, University of - PowerPoint PPT Presentation

Optimization and (Under/over)fitting EECS 442 – David Fouhey Fall 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/

Regularized Least Squares Add regularization to objective that prefers some solutions: 2 arg min 𝒛 − 𝒀𝒙 2 Loss Before: 𝒙 2 + 𝜇 𝒙 2 2 arg min 𝒛 − 𝒀𝒙 2 After: 𝒙 Loss Trade-off Regularization Want model “smaller”: pay a penalty for w with big norm Intuitive Objective: accurate model (low loss) but not too complex (low regularization). λ controls how much of each.

Nearest Neighbors Test Known Images Image Labels 𝐸(𝒚 1 , 𝒚 𝑈 ) 𝒚 1 𝒚 𝑈 Cat Cat! … 𝐸(𝒚 𝑂 , 𝒚 𝑈 ) (1) Compute distance between feature vectors (2) find nearest 𝒚 𝑂 Dog (3) use label.

Picking Parameters What distance? What value for k / λ ? Training Validation Test Use these data Evaluate on these points for lookup points for different k, λ , distances

Linear Models Example Setup: 3 classes Model – one weight per class: 𝒙 0 , 𝒙 1 , 𝒙 2 𝑈 𝒚 big if cat 𝒙 0 Want: 𝑈 𝒚 big if dog 𝒙 1 𝑈 𝒚 big if hippo 𝒙 2 Stack together: where x is in R F 𝑿 𝟒𝒚𝑮

Linear Models Cat weight vector Cat score 0.2 -0.5 0.1 2.0 1.1 56 -96.8 Dog weight vector Dog score 1.5 1.3 2.1 0.0 3.2 231 437.9 Hippo weight vector Hippo score 0.0 0.3 0.2 -0.3 -1.2 24 61.95 2 𝑿𝒚 𝒋 𝑿 1 Prediction is vector Weight matrix a collection of where jth component is scoring functions, one per class 𝒚 𝒋 “score” for jth class. Diagram by: Karpathy, Fei-Fei

Objective 1: Multiclass SVM* (Take the class whose Inference (x,y): arg max 𝑿𝒚 𝑙 weight vector gives the k highest score) Training ( x i ,y i ): 𝑜 𝟑 + ෍ arg min 𝑿 𝝁 𝑿 𝟑 ෍ max 0, (𝑿𝒚 𝑗 𝑘 − 𝑿𝒚 𝒋 𝑧 𝑗 ) 𝑗=1 𝑘 ≠𝑧 𝑗 Regularization Pay no penalty if prediction Over all data for class y i is bigger than j. points Otherwise, pay proportional For every class to the score of the wrong j that’s NOT the class. correct one (y i )

Objective 1: Multiclass SVM (Take the class whose Inference (x,y): arg max 𝑿𝒚 𝑙 weight vector gives the k highest score) Training ( x i ,y i ): 𝑜 𝟑 + ෍ arg min 𝑿 𝝁 𝑿 𝟑 ෍ max 0, (𝑿𝒚 𝑗 𝑘 − 𝑿𝒚 𝒋 𝑧 𝑗 + 𝑛) 𝑗=1 𝑘 ≠𝑧 𝑗 Regularization Pay no penalty if prediction Over all data for class y i is bigger than j points by m (“margin”). Otherwise, For every class pay proportional to the j that’s NOT the score of the wrong class. correct one (y i )

Objective 1: Called: Support Vector Machine Lots of great theory as to why this is a sensible thing to do. See Useful book (Free too!): The Elements of Statistical Learning Hastie, Tibshirani, Friedman https://web.stanford.edu/~hastie/ElemStatLearn/

Objective 2: Making Probabilities Converting Scores to “Probability Distribution” -0.9 e -0.9 0.41 0.11 Cat score P(cat) Norm 0.4 exp(x) e 0.4 1.49 0.40 Dog score P(dog) 0.6 e 0.6 1.82 0.49 Hippo score P(hippo) ∑=3.72 exp (𝑋𝑦 𝑘 ) Generally P(class j): σ 𝑙 exp( 𝑋𝑦 𝑙 ) Called softmax function Inspired by: Karpathy, Fei-Fei

Objective 2: Softmax (Take the class whose Inference (x): arg max 𝑿𝒚 𝑙 weight vector gives the k highest score) exp (𝑋𝑦 𝑘 ) P(class j): σ 𝑙 exp( 𝑋𝑦 𝑙 ) Why can we skip the exp/sum exp thing to make a decision?

Objective 2: Softmax (Take the class whose Inference (x): arg max 𝑿𝒚 𝑙 weight vector gives the k highest score) Training ( x i ,y i ): P(correct class) 𝑜 exp( 𝑋𝑦 𝑧 𝑗 ) 𝟑 + ෍ arg min 𝑿 𝝁 𝑿 𝟑 − log σ 𝑙 exp( 𝑋𝑦 𝑙 )) 𝑗=1 Regularization Over all data Pay penalty for not making points correct class likely. “Negative log - likelihood”

Objective 2: Softmax P(correct) = 0.05: 3.0 penalty P(correct) = 0.5: 0.11 penalty P(correct) = 0.9: 0.11 penalty P(correct) = 1: No penalty!

How Do We Optimize Things? Goal: find the w minimizing arg min 𝒙∈𝑆 𝑂 𝑀(𝒙) some loss function L. 𝑜 exp( 𝑋𝑦 𝑧 𝑗 ) 𝟑 + ෍ 𝑀 𝑿 = 𝝁 𝑿 𝟑 − log σ 𝑙 exp( 𝑋𝑦 𝑙 )) Works for lots of 𝑗=1 𝑜 different Ls: 2 2 + ෍ 𝑧 𝑗 − 𝒙 𝑼 𝒚 𝒋 𝑀(𝒙) = 𝜇 𝒙 2 𝑗=1 𝑜 2 + ෍ max 0,1 − 𝑧 𝑗 𝒙 𝑈 𝒚 𝒋 𝑀 𝒙 = 𝐷 𝒙 2 𝑗=1

Sample Function to Optimize f(x,y) = (x+2y-7) 2 + (2x+y-5) 2 Global minimum

Sample Function to Optimize • I’ll switch back and forth between this 2D function (called the Booth Function ) and other more-learning-focused functions • Beauty of optimization is that it’s all the same in principle • But don’t draw too many conclusions: 2D space has qualitative differences from 1000D space See intro of: Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization NIPS 2014 https://ganguli-gang.stanford.edu/pdf/14.SaddlePoint.NIPS.pdf

A Caveat • Each point in the picture is a function evaluation • Here it takes microseconds – so we can easily see the answer • Functions we want to optimize may take hours to evaluate

A Caveat Model in your head: moving around a landscape with a teleportation device 20 -20 -20 20 Landscape diagram: Karpathy and Fei-Fei

Option #1A – Grid Search #systematically try things best, bestScore = None, Inf for dim1Value in dim1Values: …. for dimNValue in dimNValues: w = [dim1Value, …, dimNValue] if L( w ) < bestScore: best, bestScore = w , L( w ) return best

Option #1A – Grid Search

Option #1A – Grid Search Pros : Cons : 1. Super simple 1. Scales horribly to high 2. Only requires being dimensional spaces able to evaluate model Complexity: samplesPerDim numberOfDims

Option #1B – Random Search #Do random stuff RANSAC Style best, bestScore = None, Inf for iter in range(numIters): w = random(N,1) #sample score = 𝑀 𝒙 #evaluate if score < bestScore: best, bestScore = w , score return best

Option #1B – Random Search

Option #1B – Random Search Pros : Cons : 1. Super simple 1. Slow – throwing darts 2. Only requires being at high dimensional able to sample model dart board and evaluate it 2. Might miss something P(all correct) = ε N ε Good parameters All parameters 1 0

When Do You Use Options 1A/1B? Use these when • Number of dimensions small, space bounded • Objective is impossible to analyze (e.g., test accuracy if we use this distance function) Random search is arguably more effective; grid search makes it easy to systematically test something (people love certainty)

Option 2 – Use The Gradient Arrows: gradient

Option 2 – Use The Gradient Arrows: gradient direction (scaled to unit length)

Option 2 – Use The Gradient arg min 𝒙 𝑀(𝒙) Want: 𝜖𝑀/𝜖𝒚 1 What’s the geometric ⋮ ∇ 𝒙 𝑀 𝒙 = interpretation of: 𝜖𝑀/𝜖𝒚 𝑂 Which is bigger (for small α )? ≤? 𝑀 𝒙 𝑀 𝒙 + 𝛽∇ 𝒙 𝑀(𝒙) >?

Option 2 – Use The Gradient Arrows: gradient direction (scaled to unit length) 𝒚 𝒚 + 𝛽𝛂

Option 2 – Use The Gradient Method: at each step, move in direction of negative gradient w0 = initialize () #initialize for iter in range(numIters): g = ∇ 𝒙 𝑀 𝒙 #eval gradient w = w + - stepsize (iter)* g #update w return w

Gradient Descent Given starting point (blue) w i+1 = w i + -9.8x10 -2 x gradient

Computing the Gradient How Do You Compute The Gradient? Numerical Method: How do you compute this? 𝜖𝑀(𝑥) 𝜖𝑔(𝑦) 𝑔 𝑦 + 𝜗 − 𝑔(𝑦) 𝜖𝑦 1 = lim 𝜖𝑦 𝜗 𝜗→0 ∇ 𝒙 𝑀 𝒙 = ⋮ 𝜖𝑀(𝑥) In practice, use: 𝜖𝑦 𝑜 𝑔 𝑦 + 𝜗 − 𝑔 𝑦 − 𝜗 2𝜗

Computing the Gradient How Do You Compute The Gradient? Numerical Method: 𝑔 𝑦 + 𝜗 − 𝑔 𝑦 − 𝜗 Use: 𝜖𝑀(𝑥) 2𝜗 𝜖𝑦 1 ∇ 𝒙 𝑀 𝒙 = ⋮ How many function 𝜖𝑀(𝑥) evaluations per dimension? 𝜖𝑦 𝑜

Computing the Gradient How Do You Compute The Gradient? Analytical Method: 𝜖𝑀(𝑥) 𝜖𝑦 1 Use calculus! ∇ 𝒙 𝑀 𝒙 = ⋮ 𝜖𝑀(𝑥) 𝜖𝑦 𝑜

Computing the Gradient 𝑜 2 2 + ෍ 𝑧 𝑗 − 𝒙 𝑼 𝒚 𝒋 𝑀(𝒙) = 𝜇 𝒙 2 𝑗=1 𝜖 𝜖𝒙 𝑜 − 2 𝑧 𝑗 − 𝒙 𝑈 𝒚 𝒋 𝒚 𝑗 ∇ 𝒙 𝑀(𝒙) = 2𝜇𝒙 + ෍ 𝑗=1 Note: if you look at other derivations, things are written either (y-w T x) or (w T x – y); the gradients will differ by a minus.

Interpreting Gradients (1 Sample) Recall: w = w + - ∇ 𝒙 𝑀 𝒙 #update w ∇ 𝒙 𝑀(𝒙) = 2𝜇𝒙 + − 2 𝑧 − 𝒙 𝑈 𝒚 𝒚 α Push w towards 0 −∇ 𝒙 𝑀 𝒙 = −2𝜇𝒙 + 2 𝑧 − 𝒙 𝑈 𝒚 𝒚 If y > w T x (too low ): then w = w + α x for some α Before : w T x After : (w+ α x) T x = w T x + α x T x

Quick annoying detail: subgradients What is the derivative of |x|? |x| Derivatives/Gradients Defined everywhere but 0 𝜖 𝑦 ≠ 0 𝜖𝑦 𝑔 𝑦 = sign(𝑦) undefined 𝑦 = 0 Oh no! A discontinuity!

(Under/over)fitting EECS 442 David Fouhey Fall 2019, University of - PowerPoint PPT Presentation

Optimization and (Under/over)fitting EECS 442 David Fouhey Fall 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/ Regularized Least Squares Add regularization to objective that prefers some solutions: 2

Track fitting, vertex fitting and Track fitting, vertex fitting and Track fitting, vertex fitting

Week 2 Video 5 Cross-Validation and Over-Fitting Over-Fitting Ive mentioned over-fitting a

Over fitting distribution functions over Bayesian Regression / " ' i diggllloise dist

Lecture 11 Fitting ARIMA Models 10/10/2018 1 Model Fitting Fitting ARIMA For an

Functions and Data Fitting COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning

Fitting high resolution structures into low resolution EM maps Michael Rossmann Purdue

Fitting a Line, Residuals, and Correlation October 28, 2019 October 28, 2019 1 / 36 Fitting a

Unit 1: Data Fitting Motivation Data fitting: Construct a continuous function that represents

Fitting a Line, Residuals, and Correlation August 27, 2019 August 27, 2019 1 / 54 Fitting a

Least Squares and Data Fitting Data fitting How do we best fit a set of data points? Linear

Fitting Agent Fitting Agent- -Based Models to Based Models to Historical Networks Historical

Mechanical Fitting Failures Reporting and Data Analysis - 1 - MFFR Reporting 191.12

Estimating Criteria for for Fitting Fitting B B- -spline Curves spline Curves: : Estimating

Outline Fitting Surfaces to Very Large Meshes Multiresolution Operators Building Base

Lecture 19 Fitting CAR and SAR Models Colin Rundel 03/29/2017 1 Fitting areal models 2 CAR

Lecture 9: Fitting, Contours Thursday, Sept 27 Announcements Midterm review: next Wed Oct

Hot Streak and Cold Streak Programming Tools Michael D. Shah Tufts University

for Java Developers About Me C# since 2005 till 2012 Java since 2012 JavaScript, Dart

Concolic Testing Dynamic Symbolic Execution Marco Probst Albert-Ludwigs-Universitt Freiburg

E [ Y | X = x ] = yPr [ Y = y | X = x ] Warning: This lecture is rated R. Definition Let X and

An ancient problem: finding l Ratio of a circle s circumference to its diameter =

Smart Code editors with JavaFX (and e4) Tom Schindl <tom.schindl@bestsolution.at> Twitter:

CS70: Lecture 29 Review: Continuous Probability A Picture Key idea: For a continuous RV, Pr [ X =

rs r

Sambuz

Useful Links

Newsletter

Mail Us

(Under/over)fitting EECS 442 David Fouhey Fall 2019, University of - PowerPoint PPT Presentation

Optimization and (Under/over)fitting EECS 442 David Fouhey Fall 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/ Regularized Least Squares Add regularization to objective that prefers some solutions: 2

Track fitting, vertex fitting and Track fitting, vertex fitting and Track fitting, vertex fitting

Week 2 Video 5 Cross-Validation and Over-Fitting Over-Fitting Ive mentioned over-fitting a

Over fitting distribution functions over Bayesian Regression / &quot; ' i diggllloise dist

Lecture 11 Fitting ARIMA Models 10/10/2018 1 Model Fitting Fitting ARIMA For an

Functions and Data Fitting COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning

Fitting high resolution structures into low resolution EM maps Michael Rossmann Purdue

Fitting a Line, Residuals, and Correlation October 28, 2019 October 28, 2019 1 / 36 Fitting a

Unit 1: Data Fitting Motivation Data fitting: Construct a continuous function that represents

Fitting a Line, Residuals, and Correlation August 27, 2019 August 27, 2019 1 / 54 Fitting a

Least Squares and Data Fitting Data fitting How do we best fit a set of data points? Linear

Fitting Agent Fitting Agent- -Based Models to Based Models to Historical Networks Historical

Mechanical Fitting Failures Reporting and Data Analysis - 1 - MFFR Reporting 191.12

Estimating Criteria for for Fitting Fitting B B- -spline Curves spline Curves: : Estimating

Outline Fitting Surfaces to Very Large Meshes Multiresolution Operators Building Base

Lecture 19 Fitting CAR and SAR Models Colin Rundel 03/29/2017 1 Fitting areal models 2 CAR

Lecture 9: Fitting, Contours Thursday, Sept 27 Announcements Midterm review: next Wed Oct

Hot Streak and Cold Streak Programming Tools Michael D. Shah Tufts University

for Java Developers About Me C# since 2005 till 2012 Java since 2012 JavaScript, Dart

Concolic Testing Dynamic Symbolic Execution Marco Probst Albert-Ludwigs-Universitt Freiburg

E [ Y | X = x ] = yPr [ Y = y | X = x ] Warning: This lecture is rated R. Definition Let X and

An ancient problem: finding l Ratio of a circle s circumference to its diameter =

Smart Code editors with JavaFX (and e4) Tom Schindl &lt;tom.schindl@bestsolution.at&gt; Twitter:

CS70: Lecture 29 Review: Continuous Probability A Picture Key idea: For a continuous RV, Pr [ X =

rs r

Sambuz

Useful Links

Newsletter

Mail Us

Over fitting distribution functions over Bayesian Regression / " ' i diggllloise dist

Smart Code editors with JavaFX (and e4) Tom Schindl <tom.schindl@bestsolution.at> Twitter: