x 2 > b w T x > 0 SPAM!! x ( x , 1) w 3 x 3 w T x + b ( w - - PowerPoint PPT Presentation

x 2 b w t x 0 spam x x 1 w 3 x 3 w t x b w b t x 1 cmpsci
SMART_READER_LITE
LIVE PREVIEW

x 2 > b w T x > 0 SPAM!! x ( x , 1) w 3 x 3 w T x + b ( w - - PowerPoint PPT Presentation

A neuron (or how our brains work) Linear models Subhransu Maji CMPSCI 670: Computer Vision November 3, 2016 Neuroscience 101 CMPSCI 670 Subhransu Maji (UMASS) 2 Perceptron Example: Spam Input are feature values Imagine 3 features (spam


slide-1
SLIDE 1

Subhransu Maji

November 3, 2016

CMPSCI 670: Computer Vision

Linear models

Subhransu Maji (UMASS) CMPSCI 670

A neuron (or how our brains work)

2

Neuroscience 101

Subhransu Maji (UMASS) CMPSCI 670

Input are feature values Each feature has a weight Sum in the activation If the activation is:

  • > b, output class 1
  • otherwise, output class 2

Perceptron

3

> b

Σ

w1 w2 w3 x3 x2

x1 activation(w, x) = X

i

wixi = wT x x → (x, 1) wT x + b → (w, b)T (x, 1)

Subhransu Maji (UMASS) CMPSCI 670

Imagine 3 features (spam is “positive” class):

  • free (number of occurrences of “free”)
  • money (number of occurrences of “money”)
  • BIAS (intercept, always has value 1)

Example: Spam

4

email

w x wT x wT x > 0 → SPAM!!

slide-2
SLIDE 2

Subhransu Maji (UMASS) CMPSCI 670

In the space of feature vectors

  • examples are points (in D dimensions)
  • an weight vector is a hyperplane (a D-1 dimensional object)
  • One side corresponds to y=+1
  • Other side corresponds to y=-1

Perceptrons are also called as linear classifiers

Geometry of the perceptron

5

w

wT x = 0

Subhransu Maji (UMASS) CMPSCI 670

Initialize for iter = 1,…,T

  • for i = 1,..,n
  • predict according to the current model
  • if , no change
  • else,

Learning a perceptron

6

yi = ˆ yi w ← w + yixi w ← [0, . . . , 0]

(x1, y1), (x2, y2), . . . , (xn, yn) Input: training data

Perceptron training algorithm [Rosenblatt 57]

xi

w yix ˆ yi = ⇢ +1 if wT xi > 0 −1 if wT xi ≤ 0

error driven, online, activations increase for +, randomize yi = −1

Subhransu Maji (UMASS) CMPSCI 670

Separability: some parameters will classify the training data perfectly Convergence: if the training data is separable then the perceptron training will eventually converge [Block 62, Novikoff 62] Mistake bound: the maximum number of mistakes is related to the margin

Properties of perceptrons

7

#mistakes <

1 δ2

assuming, ||xi|| ≤ 1 δ = maxw min(xi,yi) ⇥ yiwT xi ⇤ such that, ||w|| = 1

Subhransu Maji (UMASS) CMPSCI 670

Convergence: if the data isn’t separable, the training algorithm may not terminate

  • noise can cause this
  • some simple functions are not

separable (xor) Mediocre generation: the algorithm finds a solution that “barely” separates the data Overtraining: test/validation accuracy rises and then falls

  • Overtraining is a kind of overfitting

Limitations of perceptrons

8

slide-3
SLIDE 3

Subhransu Maji (UMASS) CMPSCI 670

Linear models

  • Perceptron: model and learning algorithm combined as one
  • Is there a better way to learn linear models?

We will separate models and learning algorithms

  • Learning as optimization
  • Surrogate loss function
  • Regularization
  • Gradient descent
  • Batch and online gradients
  • Subgradient descent
  • Support vector machines

Overview

9

}model design } optimization

Subhransu Maji (UMASS) CMPSCI 670

Learning as optimization

10

min

w

X

n

1[ynwT xn < 0] + λR(w) fewest mistakes

The perceptron algorithm will find an optimal w if the data is separable

  • efficiency depends on the margin and norm of the data

However, if the data is not separable, optimizing this is NP-hard

  • i.e., there is no efficient way to minimize this unless P=NP

Subhransu Maji (UMASS) CMPSCI 670

In addition to minimizing training error, we want a simpler model

  • Remember our goal is to minimize generalization error
  • Recall the bias and variance tradeoff for learners

We can add a regularization term R(w) that prefers simpler models

  • For example we may prefer decision trees of shallow depth

Here λ is a hyperparameter of optimization problem

Learning as optimization

11

min

w

X

n

1[ynwT xn < 0] + λR(w) simpler model fewest mistakes hyperparameter

Subhransu Maji (UMASS) CMPSCI 670

The questions that remain are:

  • What are good ways to adjust the optimization problem so that

there are efficient algorithms for solving it?

  • What are good regularizations R(w) for hyperplanes?
  • Assuming that the optimization problem can be adjusted

appropriately, what algorithms exist for solving the regularized

  • ptimization problem?

Learning as optimization

12

min

w

X

n

1[ynwT xn < 0] + λR(w) simpler model fewest mistakes hyperparameter

slide-4
SLIDE 4

Subhransu Maji (UMASS) CMPSCI 670

Zero/one loss is hard to optimize

  • Small changes in w can cause large changes in the loss

Surrogate loss: replace Zero/one loss by a smooth function

  • Easier to optimize if the surrogate loss is convex

Examples:

Convex surrogate loss functions

13

−2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 3 1 2 3 4 5 6 7 8 9

Prediction Loss Zero/one Hinge Logistic Exponential Squared

ˆ y ← wT x y = +1 concave convex

Subhransu Maji (UMASS) CMPSCI 670

What are good regularization functions R(w) for hyperplanes? We would like the weights —

  • To be small —

➡ Change in the features cause small change to the score ➡ Robustness to noise

  • To be sparse —

➡ Use as few features as possible ➡ Similar to controlling the depth of a decision tree

This is a form of inductive bias

Weight regularization

14 Subhransu Maji (UMASS) CMPSCI 670

Just like the surrogate loss function, we would like R(w) to be convex Small weights regularization Sparsity regularization Family of “p-norm” regularization

Weight regularization

15

R(norm)(w) = sX

d

w2

d

R(sqrd)(w) = X

d

w2

d

R(count)(w) = X

d

1[|wd| > 0] not convex R(p-norm)(w) = X

d

|wd|p !1/p

Subhransu Maji (UMASS) CMPSCI 670

Contours of p-norms

16

convex for p ≥ 1

http://en.wikipedia.org/wiki/Lp_space

slide-5
SLIDE 5

Subhransu Maji (UMASS) CMPSCI 670

Contours of p-norms

17

not convex for 0 ≤ p < 1 p = 2 3 p = 0

R(count)(w) = X

d

1[|wd| > 0] Counting non-zeros:

http://en.wikipedia.org/wiki/Lp_space

Subhransu Maji (UMASS) CMPSCI 670

Select a suitable:

  • convex surrogate loss
  • convex regularization

Select the hyperparameter λ Minimize the regularized objective with respect to w This framework for optimization is called Tikhonov regularization or generally Structural Risk Minimization (SRM)

General optimization framework

18

regularization surrogate loss hyperparameter min

w

X

n

`

  • yn, wT xn
  • + R(w)

http://en.wikipedia.org/wiki/Tikhonov_regularization

Subhransu Maji (UMASS) CMPSCI 670

Optimization by gradient descent

19

Convex function

p1 p2 p5 p6

η1 p3

η2

p4

η3 step size local optima = global optima local optima global optima

Non-convex function pk+1 ← pk − ηkg(k)

take a step down the gradient

g(k) rpF(p)|pk

compute gradient at the current location

Subhransu Maji (UMASS) CMPSCI 670

Choice of step size

20

Good step size

p1 p2 p3 p4 p5 p6 p1 p2 p3 p4 p5 p6

η1 η1

Bad step size

The step size is important —

  • too small: slow convergence
  • too large: no convergence

A strategy is to use large step sizes initially and small step sizes later: There are methods that converge faster by adapting step size to the curvature of the function

  • Field of convex optimization

ηt ← η0/(t0 + t)

http://stanford.edu/~boyd/cvxbook/

slide-6
SLIDE 6

Subhransu Maji (UMASS) CMPSCI 670

Example: Exponential loss

21

L(w) = X

n

exp(−ynwT xn) + λ 2 ||w||2

  • bjective

dL dw = X

n

−ynxn exp(−ynwT xn) + λw

gradient update

w ← w − η X

n

−ynxn exp(−ynwT xn) + λw ! w ← w + cynxn loss term

high for misclassified points similar to the perceptron update rule!

w ← (1 − ηλ)w regularization term

shrinks weights towards zero

Subhransu Maji (UMASS) CMPSCI 670

Batch and online gradients

22

w ← w − η X

n

dLn dw ! batch gradient w ← w − η ✓dLn dw ◆

  • nline gradient

L(w) = X

n

Ln(w) w ← w − η dL dw

  • bjective

gradient descent sum of n gradients gradient at nth point

update weight after you see all points update weights after you see each point

Online gradients are the default method for multi-layer perceptrons

Subhransu Maji (UMASS) CMPSCI 670

The hinge loss is not differentiable at z=1 Subgradient is any direction that is below the function For the hinge loss a possible subgradient is:

Subgradient

23

1

subgradient

`(hinge)(y, wT x) = max(0, 1 − ywT x)

z z

d`hinge dw

= ⇢ if ywT x > 1 −yx

  • therwise

Subhransu Maji (UMASS) CMPSCI 670

Example: Hinge loss

24

  • bjective

w ← (1 − ηλ)w regularization term

shrinks weights towards zero

L(w) = X

n

max(0, 1 − ynwT xn) + λ 2 ||w||2 loss term

  • nly for points

w ← w + ηynxn

ynwT xn ≤ 1

perceptron update ynwT xn ≤ 0

update

w ← w − η X

n

−1[ynwT xn ≤ 1]ynxn + λw !

subgradient

dL dw = X

n

−1[ynwT xn ≤ 1]ynxn + λw

slide-7
SLIDE 7

Subhransu Maji (UMASS) CMPSCI 670

Example: Squared loss

25

  • bjective

L(w) = X

n

  • yn − wT xn

2 + λ 2 ||w||2

matrix notation equivalent loss

Subhransu Maji (UMASS) CMPSCI 670

Example: Squared loss

26

gradient exact closed-form solution At optima the gradient=0

  • bjective

Subhransu Maji (UMASS) CMPSCI 670

Assume, we have D features and N points Overall time via matrix inversion

  • The closed form solution involves computing:
  • Total time is O(D2N + D3 + DN), assuming O(D3) matrix inversion
  • If N > D, then total time is O(D2N)

Overall time via gradient descent

  • Gradient:
  • Each iteration: O(ND); T iterations: O(TND)

Which one is faster?

  • Small problems D < 100: probably faster to run matrix inversion
  • Large problems D > 10,000: probably faster to run gradient descent

Matrix inversion vs. gradient descent

27

dL dw = X

n

−2(yn − wT xn)xn + λw

Subhransu Maji (UMASS) CMPSCI 670

Under suitable conditions*, provided you pick the step sizes appropriately, the convergence rate of gradient descent is O(1/N)

  • i.e., if you want a solution within 0.0001 of the optimal you have to

run the gradient descent for N=1000 iterations. For linear models (hinge/logistic/exponential loss) and squared-norm regularization there are off-the-shelf solvers that are fast in practice: SVMperf , LIBLINEAR, PEGASOS

  • SVMperf , LIBLINEAR use a different optimization method

Optimization for linear models

28

* the function is strongly convex:

slide-8
SLIDE 8

Subhransu Maji (UMASS) CMPSCI 689 /25

Even if a feature is useful some normalization may be good Per-feature normalization

  • Centering
  • Variance scaling
  • Absolute scaling
  • Non-linear transformation

➡ square-root

Per-example normalization

  • fixed norm for each example

Feature normalization

29

||x|| = 1 xn,d ← xn,d − µd xn,d ← xn,d/σd xn,d ← xn,d/rd

µd = 1 N X

n

xn,d σd = s 1 N X

n

(xn,d − µd)2

rd = max

n

|xn,d|

Caltech-101 image classification 41.6% linear 63.8% square-root

xn,d ← √xn,d

(corrects for burstiness)

Subhransu Maji (UMASS) CMPSCI 670

Figures of various “p-norms” are from Wikipedia

  • http://en.wikipedia.org/wiki/Lp_space

Some of the slides are based on CIML book by Hal Daume III

Slides credit

30 Subhransu Maji (UMASS) CMPSCI 670

Appendix: code for surrogateLoss

31 % Code to plot various loss functions y1=1; y2=linspace(−2,3,500); zeroOneLoss = y1*y2 <=0; hingeLoss = max(0, 1−y1*y2); logisticLoss = log(1+exp(−y1*y2))/log(2); expLoss = exp(−y1*y2); squaredLoss = (y1−y2).^2; % Plot them figure(1); clf; hold on; plot(y2, zeroOneLoss,’k−’,’LineWidth’,1); plot(y2, hingeLoss,’b−’,’LineWidth’,1); plot(y2, logisticLoss,’r−’,’LineWidth’,1); plot(y2, expLoss,’g−’,’LineWidth’,1); plot(y2, squaredLoss,’m−’,’LineWidth’,1); ylabel(’Prediction’,’FontSize’,16); xlabel(’Loss’,’FontSize’,16); legend({’Zero/one’, ’Hinge’, ’Logistic’, ’Exponential’, ’Squared’}, ’Location’, ’NorthEast’, ’FontSize’,16); box on;

−2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 3 1 2 3 4 5 6 7 8 9

Prediction Loss Zero/one Hinge Logistic Exponential Squared

Output Matlab code