(Under/over)fitting EECS 442 David Fouhey Fall 2019, University of - - PowerPoint PPT Presentation

โ–ถ
under over fitting
SMART_READER_LITE
LIVE PREVIEW

(Under/over)fitting EECS 442 David Fouhey Fall 2019, University of - - PowerPoint PPT Presentation

Optimization and (Under/over)fitting EECS 442 David Fouhey Fall 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/ Regularized Least Squares Add regularization to objective that prefers some solutions: 2


slide-1
SLIDE 1

Optimization and (Under/over)fitting

EECS 442 โ€“ David Fouhey Fall 2019, University of Michigan

http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/

slide-2
SLIDE 2

Regularized Least Squares

Add regularization to objective that prefers some solutions:

arg min

๐’™

๐’› โˆ’ ๐’€๐’™ 2

2

Loss Before: After:

arg min

๐’™

๐’› โˆ’ ๐’€๐’™ 2

2 + ๐œ‡ ๐’™ 2 2

Loss Regularization Trade-off Want model โ€œsmallerโ€: pay a penalty for w with big norm Intuitive Objective: accurate model (low loss) but not too complex (low regularization). ฮป controls how much of each.

slide-3
SLIDE 3

Nearest Neighbors

Known Images Labels

โ€ฆ

๐’š1 ๐’š๐‘‚

Test Image

๐’š๐‘ˆ ๐ธ(๐’š๐‘‚, ๐’š๐‘ˆ) ๐ธ(๐’š1, ๐’š๐‘ˆ)

(1) Compute distance between feature vectors (2) find nearest (3) use label.

Cat Dog Cat!

slide-4
SLIDE 4

Picking Parameters

What distance? What value for k / ฮป? Training Test Validation Use these data points for lookup Evaluate on these points for different k, ฮป, distances

slide-5
SLIDE 5

Linear Models

Example Setup: 3 classes ๐’™0, ๐’™1, ๐’™2 Model โ€“ one weight per class:

big if cat ๐’™0

๐‘ˆ๐’š

big if dog ๐’™1

๐‘ˆ๐’š

big if hippo ๐’™2

๐‘ˆ๐’š

๐‘ฟ๐Ÿ’๐’š๐‘ฎ Stack together: where x is in RF Want:

slide-6
SLIDE 6

Linear Models

0.2

  • 0.5

0.1 2.0 1.5 1.3 2.1 0.0 0.0 0.3 0.2

  • 0.3

1.1 3.2

  • 1.2

๐‘ฟ

56 231 24 2 1

๐’š๐’‹

Cat weight vector Dog weight vector Hippo weight vector

๐‘ฟ๐’š๐’‹

  • 96.8

437.9 61.95

Cat score Dog score Hippo score

Diagram by: Karpathy, Fei-Fei

Weight matrix a collection of scoring functions, one per class Prediction is vector where jth component is โ€œscoreโ€ for jth class.

slide-7
SLIDE 7

Objective 1: Multiclass SVM*

arg min

๐‘ฟ ๐ ๐‘ฟ ๐Ÿ‘ ๐Ÿ‘ + เท ๐‘—=1 ๐‘œ

เท

๐‘˜ โ‰ ๐‘ง๐‘—

max 0, (๐‘ฟ๐’š๐‘— ๐‘˜ โˆ’ ๐‘ฟ๐’š๐’‹ ๐‘ง๐‘—)

Training (xi,yi):

Regularization Over all data points For every class j thatโ€™s NOT the correct one (yi) Pay no penalty if prediction for class yi is bigger than j. Otherwise, pay proportional to the score of the wrong class.

Inference (x,y): arg max

k

๐‘ฟ๐’š ๐‘™

(Take the class whose weight vector gives the highest score)

slide-8
SLIDE 8

Objective 1: Multiclass SVM

arg min

๐‘ฟ ๐ ๐‘ฟ ๐Ÿ‘ ๐Ÿ‘ + เท ๐‘—=1 ๐‘œ

เท

๐‘˜ โ‰ ๐‘ง๐‘—

max 0, (๐‘ฟ๐’š๐‘— ๐‘˜ โˆ’ ๐‘ฟ๐’š๐’‹ ๐‘ง๐‘— + ๐‘›)

Training (xi,yi):

Regularization Over all data points For every class j thatโ€™s NOT the correct one (yi) Pay no penalty if prediction for class yi is bigger than j by m (โ€œmarginโ€). Otherwise, pay proportional to the score of the wrong class.

Inference (x,y): arg max

k

๐‘ฟ๐’š ๐‘™

(Take the class whose weight vector gives the highest score)

slide-9
SLIDE 9

Objective 1:

Called: Support Vector Machine Lots of great theory as to why this is a sensible thing to do. See

Useful book (Free too!): The Elements of Statistical Learning Hastie, Tibshirani, Friedman https://web.stanford.edu/~hastie/ElemStatLearn/

slide-10
SLIDE 10

Objective 2: Making Probabilities

  • 0.9

0.4 0.6

Cat score Dog score Hippo score

exp(x) e-0.9 e0.4 e0.6 0.41 1.49 1.82

โˆ‘=3.72

Norm 0.11 0.40 0.49

P(cat) P(dog) P(hippo)

Converting Scores to โ€œProbability Distributionโ€

exp (๐‘‹๐‘ฆ ๐‘˜) ฯƒ๐‘™ exp( ๐‘‹๐‘ฆ ๐‘™)

Generally P(class j):

Inspired by: Karpathy, Fei-Fei

Called softmax function

slide-11
SLIDE 11

Objective 2: Softmax

Inference (x):

arg max

k

๐‘ฟ๐’š ๐‘™

(Take the class whose weight vector gives the highest score)

exp (๐‘‹๐‘ฆ ๐‘˜) ฯƒ๐‘™ exp( ๐‘‹๐‘ฆ ๐‘™)

P(class j): Why can we skip the exp/sum exp thing to make a decision?

slide-12
SLIDE 12

Objective 2: Softmax

Inference (x):

arg max

k

๐‘ฟ๐’š ๐‘™

(Take the class whose weight vector gives the highest score)

Training (xi,yi):

arg min

๐‘ฟ ๐ ๐‘ฟ ๐Ÿ‘ ๐Ÿ‘ + เท ๐‘—=1 ๐‘œ

โˆ’ log exp( ๐‘‹๐‘ฆ ๐‘ง๐‘—) ฯƒ๐‘™ exp( ๐‘‹๐‘ฆ ๐‘™)) Regularization Over all data points P(correct class) Pay penalty for not making correct class likely. โ€œNegative log-likelihoodโ€

slide-13
SLIDE 13

Objective 2: Softmax

P(correct) = 1: No penalty! P(correct) = 0.9: 0.11 penalty P(correct) = 0.5: 0.11 penalty P(correct) = 0.05: 3.0 penalty

slide-14
SLIDE 14

How Do We Optimize Things?

arg min

๐’™โˆˆ๐‘†๐‘‚ ๐‘€(๐’™)

Goal: find the w minimizing some loss function L.

๐‘€ ๐‘ฟ = ๐ ๐‘ฟ ๐Ÿ‘

๐Ÿ‘ + เท ๐‘—=1 ๐‘œ

โˆ’ log exp( ๐‘‹๐‘ฆ ๐‘ง๐‘—) ฯƒ๐‘™ exp( ๐‘‹๐‘ฆ ๐‘™)) ๐‘€(๐’™) = ๐œ‡ ๐’™ 2

2 + เท ๐‘—=1 ๐‘œ

๐‘ง๐‘— โˆ’ ๐’™๐‘ผ๐’š๐’‹

2

๐‘€ ๐’™ = ๐ท ๐’™ 2

2 + เท ๐‘—=1 ๐‘œ

max 0,1 โˆ’ ๐‘ง๐‘—๐’™๐‘ˆ๐’š๐’‹

Works for lots of different Ls:

slide-15
SLIDE 15

Sample Function to Optimize

Global minimum

f(x,y) = (x+2y-7)2 + (2x+y-5)2

slide-16
SLIDE 16

Sample Function to Optimize

  • Iโ€™ll switch back and forth between this 2D

function (called the Booth Function) and other more-learning-focused functions

  • Beauty of optimization is that itโ€™s all the same in

principle

  • But donโ€™t draw too many conclusions: 2D

space has qualitative differences from 1000D space

See intro of: Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization NIPS 2014 https://ganguli-gang.stanford.edu/pdf/14.SaddlePoint.NIPS.pdf

slide-17
SLIDE 17

A Caveat

  • Each point in the picture is a

function evaluation

  • Here it takes microseconds โ€“ so

we can easily see the answer

  • Functions we want to optimize

may take hours to evaluate

slide-18
SLIDE 18

A Caveat

20

  • 20 -20

20

Model in your head: moving around a landscape with a teleportation device

Landscape diagram: Karpathy and Fei-Fei

slide-19
SLIDE 19

Option #1A โ€“ Grid Search

#systematically try things best, bestScore = None, Inf for dim1Value in dim1Values: โ€ฆ. for dimNValue in dimNValues: w = [dim1Value, โ€ฆ, dimNValue] if L(w) < bestScore: best, bestScore = w, L(w) return best

slide-20
SLIDE 20

Option #1A โ€“ Grid Search

slide-21
SLIDE 21

Option #1A โ€“ Grid Search

Pros:

  • 1. Super simple
  • 2. Only requires being

able to evaluate model Cons:

  • 1. Scales horribly to high

dimensional spaces

Complexity: samplesPerDimnumberOfDims

slide-22
SLIDE 22

Option #1B โ€“ Random Search

#Do random stuff RANSAC Style best, bestScore = None, Inf for iter in range(numIters): w = random(N,1) #sample score = ๐‘€ ๐’™ #evaluate if score < bestScore: best, bestScore = w, score return best

slide-23
SLIDE 23

Option #1B โ€“ Random Search

slide-24
SLIDE 24

Option #1B โ€“ Random Search

Pros:

  • 1. Super simple
  • 2. Only requires being

able to sample model and evaluate it Cons:

  • 1. Slow โ€“throwing darts

at high dimensional dart board

  • 2. Might miss something

Good parameters All parameters

1 ฮต P(all correct) = ฮตN

slide-25
SLIDE 25

When Do You Use Options 1A/1B?

Use these when

  • Number of dimensions small, space bounded
  • Objective is impossible to analyze (e.g., test

accuracy if we use this distance function) Random search is arguably more effective; grid search makes it easy to systematically test something (people love certainty)

slide-26
SLIDE 26

Option 2 โ€“ Use The Gradient

Arrows: gradient

slide-27
SLIDE 27

Option 2 โ€“ Use The Gradient

Arrows: gradient direction (scaled to unit length)

slide-28
SLIDE 28

Option 2 โ€“ Use The Gradient

arg min

๐’™ ๐‘€(๐’™)

โˆ‡๐’™๐‘€ ๐’™ = ๐œ–๐‘€/๐œ–๐’š1 โ‹ฎ ๐œ–๐‘€/๐œ–๐’š๐‘‚

Whatโ€™s the geometric interpretation of: Want: Which is bigger (for small ฮฑ)?

๐‘€ ๐’™ ๐‘€ ๐’™ + ๐›ฝโˆ‡๐’™๐‘€(๐’™) โ‰ค? >?

slide-29
SLIDE 29

Option 2 โ€“ Use The Gradient

Arrows: gradient direction (scaled to unit length)

๐’š ๐’š + ๐›ฝ๐›‚

slide-30
SLIDE 30

Option 2 โ€“ Use The Gradient

w0 = initialize() #initialize for iter in range(numIters): g = โˆ‡๐’™๐‘€ ๐’™ #eval gradient w = w + -stepsize(iter)*g #update w return w Method: at each step, move in direction

  • f negative gradient
slide-31
SLIDE 31

Gradient Descent

Given starting point (blue) wi+1 = wi + -9.8x10-2 x gradient

slide-32
SLIDE 32

Computing the Gradient

How Do You Compute The Gradient? Numerical Method:

โˆ‡๐’™๐‘€ ๐’™ = ๐œ–๐‘€(๐‘ฅ) ๐œ–๐‘ฆ1 โ‹ฎ ๐œ–๐‘€(๐‘ฅ) ๐œ–๐‘ฆ๐‘œ ๐œ–๐‘”(๐‘ฆ) ๐œ–๐‘ฆ = lim

๐œ—โ†’0

๐‘” ๐‘ฆ + ๐œ— โˆ’ ๐‘”(๐‘ฆ) ๐œ— How do you compute this?

๐‘” ๐‘ฆ + ๐œ— โˆ’ ๐‘” ๐‘ฆ โˆ’ ๐œ— 2๐œ—

In practice, use:

slide-33
SLIDE 33

Computing the Gradient

How Do You Compute The Gradient? Numerical Method:

โˆ‡๐’™๐‘€ ๐’™ = ๐œ–๐‘€(๐‘ฅ) ๐œ–๐‘ฆ1 โ‹ฎ ๐œ–๐‘€(๐‘ฅ) ๐œ–๐‘ฆ๐‘œ How many function evaluations per dimension?

๐‘” ๐‘ฆ + ๐œ— โˆ’ ๐‘” ๐‘ฆ โˆ’ ๐œ— 2๐œ—

Use:

slide-34
SLIDE 34

Computing the Gradient

How Do You Compute The Gradient?

โˆ‡๐’™๐‘€ ๐’™ = ๐œ–๐‘€(๐‘ฅ) ๐œ–๐‘ฆ1 โ‹ฎ ๐œ–๐‘€(๐‘ฅ) ๐œ–๐‘ฆ๐‘œ

Use calculus!

Analytical Method:

slide-35
SLIDE 35

Computing the Gradient

๐‘€(๐’™) = ๐œ‡ ๐’™ 2

2 + เท ๐‘—=1 ๐‘œ

๐‘ง๐‘— โˆ’ ๐’™๐‘ผ๐’š๐’‹

2

โˆ‡๐’™๐‘€(๐’™) = 2๐œ‡๐’™ + เท

๐‘—=1 ๐‘œ

โˆ’ 2 ๐‘ง๐‘— โˆ’ ๐’™๐‘ˆ๐’š๐’‹ ๐’š๐‘— ๐œ– ๐œ–๐’™

Note: if you look at other derivations, things are written either (y-wTx) or (wTx โ€“ y); the gradients will differ by a minus.

slide-36
SLIDE 36

Interpreting Gradients (1 Sample)

Recall: w = w + -โˆ‡๐’™๐‘€ ๐’™ #update w โˆ‡๐’™๐‘€(๐’™) = 2๐œ‡๐’™ + โˆ’ 2 ๐‘ง โˆ’ ๐’™๐‘ˆ๐’š ๐’š โˆ’โˆ‡๐’™๐‘€ ๐’™ = โˆ’2๐œ‡๐’™ + 2 ๐‘ง โˆ’ ๐’™๐‘ˆ๐’š ๐’š

Push w towards 0

If y > wTx (too low): then w = w + ฮฑx for some ฮฑ Before: wTx After: (w+ ฮฑx)Tx = wTx + ฮฑxTx

ฮฑ

slide-37
SLIDE 37

Quick annoying detail: subgradients

What is the derivative of |x|?

๐œ– ๐œ–๐‘ฆ ๐‘” ๐‘ฆ = sign(๐‘ฆ) ๐‘ฆ โ‰  0 Derivatives/Gradients Defined everywhere but 0 ๐‘ฆ = 0

undefined

Oh no! A discontinuity!

|x|

slide-38
SLIDE 38

Quick annoying detail: subgradients

Subgradient: any underestimate of function

Subderivatives/subgradients Defined everywhere In practice: at discontinuity, pick value on either side. ๐œ– ๐œ–๐‘ฆ ๐‘” ๐‘ฆ = sign(๐‘ฆ) ๐‘ฆ โ‰  0

|x|

๐‘ฆ = 0 ๐œ– ๐œ–๐‘ฆ ๐‘” ๐‘ฆ โˆˆ [โˆ’1,1]

slide-39
SLIDE 39

Computing The Gradient

  • Numerical: foolproof but slow
  • Analytical: can mess things up โ˜บ
  • In practice: do analytical, but check with

numerical (called a gradient check)

Slide: Karpathy and Fei-Fei

slide-40
SLIDE 40

Implementing Gradient Descent

Loss is a function that we can evaluate over data

โˆ’โˆ‡๐’™๐‘€ ๐’™ = โˆ’2๐œ‡๐’™ + เท

๐‘—=1 ๐‘œ

2 ๐‘ง๐‘— โˆ’ ๐’™๐‘ˆ๐’š๐’‹ ๐’š๐‘—

All Data

โˆ’โˆ‡๐’™๐‘€๐ถ ๐’™ = โˆ’2๐œ‡๐’™ + เท

๐‘—โˆˆ๐ถ

2 ๐‘ง๐‘— โˆ’ ๐’™๐‘ˆ๐’š๐’‹ ๐’š๐‘—

Subset B

slide-41
SLIDE 41

Implementing Gradient Descent

for iter in range(numIters): g = gradient(data,L) w = w + -stepsize(iter)*g #update w Option 1: Vanilla Gradient Descent Compute gradient of L over all data points

slide-42
SLIDE 42

Implementing Gradient Descent

for iter in range(numIters): index = randint(0,#data) g = gradient(data[index],L) w = w + -stepsize(iter)*g #update w Option 2: Stochastic Gradient Descent Compute gradient of L over 1 random sample

slide-43
SLIDE 43

Implementing Gradient Descent

for iter in range(numIters): subset = choose_samples(#data,B) g = gradient(data[subset],L) w = w + -stepsize(iter)*g #update w Option 3: Minibatch Gradient Descent Compute gradient of L over subset of B samples Typical batch sizes: ~100

slide-44
SLIDE 44

Gradient Descent Details

Step size (also called learning rate / lr) critical parameter

10x10-2 converges 12x10-2 diverges 1x10-2 falls short

slide-45
SLIDE 45

Gradient Descent Details

11x10-2 :oscillates (Raw gradients)

slide-46
SLIDE 46

Gradient Descent Details

One solution: start with initial rate lr, multiply by f every N interations init_lr = 10-1 f = 0.1 N = 10K

slide-47
SLIDE 47

11x10-2 :oscillates (Raw gradients)

Gradient Descent Details

Solution: Average gradients

With exponentially decaying weights, called โ€œmomentumโ€

slide-48
SLIDE 48

11x10-2 (0.25 momentum) 11x10-2 :oscillates (Raw gradients)

Gradient Descent Details

slide-49
SLIDE 49

Gradient Descent Details

Multiple Minima โ†’ Gradient Descent Finds local minimum

slide-50
SLIDE 50

Gradient Descent Details

1 2 3 4

Guess the minimum!

start

slide-51
SLIDE 51

Gradient Descent Details

slide-52
SLIDE 52

Gradient Descent Details

1 2 3 4

start

Guess the minimum!

slide-53
SLIDE 53

Gradient Descent Details

Dynamics are fairly complex Many important functions are convex: any local minimum is a global minimum Many important functions are not.

slide-54
SLIDE 54

In practice

  • Conventional wisdom: minibatch stochastic

gradient descent (SGD) + momentum (package implements it for you) + some sensibly changing learning rate

  • The above is typically what is meant by โ€œSGDโ€
  • Other update rules exist; benefits in general

not clear (sometimes better, sometimes worse)

slide-55
SLIDE 55

Optimizing Everything

  • Optimize w on training set with SGD to

maximize training accuracy

  • Optimize ฮป with random/grid search to

maximize validation accuracy

  • Note: Optimizing ฮป on training sets it to 0

๐‘€ ๐‘ฟ = ๐ ๐‘ฟ ๐Ÿ‘

๐Ÿ‘ + เท ๐‘—=1 ๐‘œ

โˆ’ log exp( ๐‘‹๐‘ฆ ๐‘ง๐‘—) ฯƒ๐‘™ exp( ๐‘‹๐‘ฆ ๐‘™)) ๐‘€(๐’™) = ๐œ‡ ๐’™ 2

2 + เท ๐‘—=1 ๐‘œ

๐‘ง๐‘— โˆ’ ๐’™๐‘ผ๐’š๐’‹

2

slide-56
SLIDE 56

(Over/Under)fitting and Complexity

Letโ€™s fit a polynomial: given x, predict y Note: can do non-linear regression with copies of x

๐‘ง1 โ‹ฎ ๐‘ง๐‘‚ = ๐‘ฆ1

๐บ

โ‹ฎ ๐‘ฆ๐‘‚

๐บ

โ‹ฏ โ‹ฑ โ‹ฏ ๐‘ฆ1

2

โ‹ฎ ๐‘ฆ๐‘‚

2

๐‘ฆ1 โ‹ฎ ๐‘ฆ๐‘‚ 1 โ‹ฎ 1 ๐‘ฅ๐บ โ‹ฎ ๐‘ฅ2 ๐‘ฅ1 ๐‘ฅ0

Weights: one per polynomial degree Matrix of all polynomial degrees

slide-57
SLIDE 57

(Over/Under)fitting and Complexity

Model: 1.5x2 + 2.3x+2 + N(0,0.5)

slide-58
SLIDE 58

Underfitting

Model: 1.5x2 + 2.3x+2 + N(0,0.5)

slide-59
SLIDE 59

Underfitting

Model doesnโ€™t have the parameters to fit the data. Bias (statistics): Error intrinsic to the model.

slide-60
SLIDE 60

Overfitting

Model: 1.5x2 + 2.3x+2 + N(0,0.5)

slide-61
SLIDE 61

Overfitting

Model has high variance: remove one point, and model changes dramatically

slide-62
SLIDE 62

(Continuous) Model Complexity

arg min

๐‘ฟ ๐ ๐‘ฟ ๐Ÿ‘ ๐Ÿ‘ + เท ๐‘—=1 ๐‘œ

โˆ’ log exp( ๐‘‹๐‘ฆ ๐‘ง๐‘—) ฯƒ๐‘™ exp( ๐‘‹๐‘ฆ ๐‘™)) Regularization: penalty for complex model Pay penalty for negative log- likelihood of correct class

Intuitively: big weights = more complex model Model 1: 0.01*x1 + 1.3*x2 + -0.02*x3 + -2.1x4 + 10 Model 2: 37.2*x1 + 13.4*x2 + 5.6*x3 + -6.1x4 + 30

slide-63
SLIDE 63

Fitting Model

Again, fitting polynomial, but with regularization

arg min

w

๐’› โˆ’ ๐’€๐’™ + ๐œ‡ ๐’™ ๐‘ฆ1

๐บ

โ‹ฎ ๐‘ฆ๐‘‚

๐บ

โ‹ฏ โ‹ฑ โ‹ฏ ๐‘ฆ1

2

โ‹ฎ ๐‘ฆ๐‘‚

2

๐‘ฆ1 โ‹ฎ ๐‘ฆ๐‘‚ 1 โ‹ฎ 1 ๐‘ฅ๐บ โ‹ฎ ๐‘ฅ0

slide-64
SLIDE 64

Adding Regularization

No regularization: fits all data points Regularization: canโ€™t fit all data points

slide-65
SLIDE 65

In General

Error on new data comes from combination of:

  • 1. Bias: model is oversimplified and canโ€™t fit the

underlying data

  • 2. Variance: you donโ€™t have the ability to

estimate your model from limited data

  • 3. Inherent: the data is intrinsically difficult

Bias and variance trade-off. Fixing one hurts the

  • ther.
slide-66
SLIDE 66

Underfitting and Overfitting

Low Bias High Variance High Bias Low Variance

Diagram adapted from: D. Hoiem

Test Error Training Error

Complexity Error

Underfitting!

slide-67
SLIDE 67

Underfitting and Overfitting

Low Bias High Variance High Bias Low Variance

Diagram adapted from: D. Hoiem

Complexity Test Error

Small data Overfits w/small model Big data Overfits w/bigger model

slide-68
SLIDE 68

Underfitting

Do poorly on both training and validation data due to bias. Solution:

  • 1. More features
  • 2. More powerful model
  • 3. Reduce regularization
slide-69
SLIDE 69

Overfitting

Do well on training data, but poorly

  • n validation data due to variance

Solution:

  • 1. More data
  • 2. Less powerful model
  • 3. Regularize your model more

Cris Dima rule: first make sure you can overfit, then stop overfitting.

slide-70
SLIDE 70

Next Class

  • Non-linear models (neural nets)
slide-71
SLIDE 71
slide-72
SLIDE 72

Letโ€™s Compute Another Gradient

  • Below is another derivation thatโ€™s worth looking

at on your own time if youโ€™re curious

slide-73
SLIDE 73

Computing The Gradient

arg min

๐‘ฟ ๐ ๐‘ฟ ๐Ÿ‘ ๐Ÿ‘ + เท ๐‘—=1 ๐‘œ

เท

๐‘˜ โ‰ ๐‘ง๐‘—

max 0, (๐‘ฟ๐’š๐‘— ๐‘˜ โˆ’ ๐‘ฟ๐’š๐’‹ ๐‘ง๐‘— + ๐‘›)

Notation: W โ†’ rows wi (i.e., per-class scorer) (Wxi)j โ†’ wj

Txi

arg min

๐‘ฟ ๐ เท ๐’Œ=๐Ÿ ๐‘ณ

๐’™๐’Œ ๐Ÿ‘

๐Ÿ‘ + เท ๐‘—=1 ๐‘œ

เท

๐‘˜ โ‰ ๐‘ง๐‘—

max(0, ๐’™๐‘˜

๐‘ˆ๐’š๐‘— โˆ’ ๐’™๐‘ง๐‘— ๐‘ˆ ๐’š๐‘— + ๐‘›)

Derivation setup: Karpathy and Fei-Fei

Multiclass Support Vector Machine

slide-74
SLIDE 74

Computing The Gradient

๐œ– ๐œ–๐‘ฅ

๐‘˜

:

arg min

๐‘ฟ ๐ เท ๐’Œ=๐Ÿ ๐‘ณ

๐’™๐’Œ ๐Ÿ‘

๐Ÿ‘ + เท ๐‘—=1 ๐‘œ

เท

๐‘˜ โ‰ ๐‘ง๐‘—

max(0, ๐’™๐‘˜

๐‘ˆ๐’š๐‘— โˆ’ ๐’™๐‘ง๐‘— ๐‘ˆ ๐’š๐‘— + ๐‘›)

๐‘ฅ

๐‘˜ ๐‘ˆ๐‘ฆ๐‘— โˆ’ ๐‘ฅ๐‘ง๐‘— ๐‘ˆ ๐‘ฆ๐‘— + ๐‘› โ‰ค 0:

๐‘ฅ

๐‘˜ ๐‘ˆ๐‘ฆ๐‘— โˆ’ ๐‘ฅ๐‘ง๐‘— ๐‘ˆ ๐‘ฆ๐‘— + ๐‘› > 0:

๐‘ฆ๐‘— โ†’ 1(๐‘ฅ๐‘˜

๐‘ˆ ๐‘ฆ๐‘— โˆ’ ๐‘ฅ๐‘ง๐‘— ๐‘ˆ ๐‘ฆ๐‘— + ๐‘› > 0)๐‘ฆ๐‘—

Derivation setup: Karpathy and Fei-Fei

slide-75
SLIDE 75

Computing The Gradient

arg min

๐‘ฟ ๐ เท ๐’Œ=๐Ÿ ๐‘ณ

๐’™๐’Œ ๐Ÿ‘

๐Ÿ‘ + เท ๐‘—=1 ๐‘œ

เท

๐‘˜ โ‰ ๐‘ง๐‘—

max(0, ๐’™๐‘˜

๐‘ˆ๐’š๐‘— โˆ’ ๐’™๐‘ง๐‘— ๐‘ˆ ๐’š๐‘— + ๐‘›)

Derivation setup: Karpathy and Fei-Fei

๐œ– ๐œ–๐‘ฅ๐‘ง๐‘— : เท

๐‘˜โ‰ ๐‘ง๐‘—

1 ๐‘ฅ

๐‘˜ ๐‘ˆ๐‘ฆ๐‘— โˆ’ ๐‘ฅ๐‘ง๐‘— ๐‘ˆ ๐‘ฆ๐‘— + ๐‘› > 0

โˆ’๐‘ฆ๐‘—

slide-76
SLIDE 76

Interpreting The Gradient

If we do not predict the correct class by at least a score difference of m โ€ฆ Want incorrect classโ€™s scoring vector to score that point lower. Recall: Before: wTx; After: (w-ฮฑx)Tx = wTx - axTx

Derivation setup: Karpathy and Fei-Fei

โˆ’ ๐œ– ๐œ–๐‘ฅ

๐‘˜

: 1(๐‘ฅ๐‘˜

๐‘ˆ๐‘ฆ๐‘— โˆ’ ๐‘ฅ๐‘ง๐‘— ๐‘ˆ ๐‘ฆ๐‘— + ๐‘› > 0) โˆ’๐‘ฆ๐‘—