[PPT] - L ECTURE 4: L INEAR CLASSIFIERS Prof. Julia Hockenmaier PowerPoint Presentation

SLIDE 1

CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign

http://courses.engr.illinois.edu/cs446

Prof. Julia Hockenmaier

juliahmr@illinois.edu

LECTURE 4: LINEAR CLASSIFIERS

SLIDE 2

CS446 Machine Learning

Announcements

Homework 1 will be out after class.

http://courses.engr.illinois.edu/cs446/Homework/HW1.pdf You have two weeks to complete the assignment. Late policy: – Up to two days late credit for the whole semester, but we don’t give any partial late credit. If you’re late for one assignment by up to 24 hours, that’s 1 of your two late credit days. – We don’t accept assignments that are more than 48 hours late.

Is everybody on Compass???

https://compass2g.illinois.edu/

Let us know if you can’t see our class.

2

SLIDE 3

CS446 Machine Learning

Last lecture’s key concepts

Decision trees for (binary) classification

Non-linear classifiers

Learning decision trees (ID3 algorithm)

Greedy heuristic (based on information gain) Originally developed for discrete features

Overfitting

What is it? How do we deal with it?

3

SLIDE 4

CS446 Machine Learning

Today’s key concepts

Learning linear classifiers Batch algorithms: – Gradient descent for Least-mean squares Online algorithms: – Stochastic gradient descent

4

SLIDE 5

CS446 Machine Learning

Linear classifiers

5

SLIDE 6

Linear classifiers: f(x) = w0 + wx

Linear classifiers are defined over vector spaces Every hypothesis f(x) is a hyperplane: f(x) = w0 + wx f(x) is also called the decision boundary – Assign ŷ = 1 to all x where f(x) > 0 – Assign ŷ = -1 to all x where f(x) < 0 ŷ = sgn(f(x))

x1 x2

f(x) = 0 f(x) < 0 f(x) > 0

SLIDE 7

H

CS446 Machine Learning

Hypothesis space for linear classifiers

7

x2 0 1 x1 0 0 0 1 0 0 x2 0 1 x1 0 0 0 1 1 0 x2 0 1 x1 0 0 0 1 0 1 x2 0 1 x1 0 1 0 1 0 0 x2 0 1 x1 0 0 1 1 0 0 x2 0 1 x1 0 0 0 1 1 1 x2 0 1 x1 0 1 0 1 1 0 x2 0 1 x1 0 0 1 1 1 0 x2 0 1 x1 0 1 0 1 0 1 x2 0 1 x1 0 0 1 1 0 1 x2 0 1 x1 0 1 1 1 0 0 x2 0 1 x1 0 1 0 1 1 1 x2 0 1 x1 0 0 1 1 1 1 x2 0 1 x1 0 1 1 1 1 0 x2 0 1 x1 0 1 1 1 0 1 x2 0 1 x1 0 1 1 1 1 1

SLIDE 8

CS446 Machine Learning

Canonical representation

With w = (w1, …, wN)T and x = (x1, …, xN)T: f(x) = w0 + wx = w0 + ∑i=1…N wixi w0 is called the bias term. The canonical representation redefines w, x as w = (w0, w1, …, wN)T and x = (1, x1, …, xN)T => f(x) = w·x

8

SLIDE 9

CS446 Machine Learning

Learning a linear classifier

9

x1 x2

f(x) = 0 f(x) < 0 f(x) > 0

x1 x2

Input: Labeled training data D = {(x1, y1),…,(xD, yD)} plotted in the sample space X = R2 with : yi = +1, : yi = 1 Output: A decision boundary f(x) = 0 that separates the training data yi·f(xi) > 0

SLIDE 10

CS446 Machine Learning

Which model should we pick?

We need a metric (aka an objective function) We would like to minimize the probability of misclassifying unseen examples, but we can’t measure that probability. Instead: minimize the number of misclassified training examples

10

SLIDE 11

CS446 Machine Learning

Which model should we pick?

We need a more specific metric: There may be many models that are consistent with the training data. Loss functions provide such metrics.

11

SLIDE 12

CS446 Machine Learning

Loss functions for classification

12

SLIDE 13

y·f(x) > 0: Correct classification

An example (x, y) is correctly classified by f(x) if and only if y·f(x) > 0: Case 1 (y = +1 = ŷ): f(x) > 0 ⇒ y·f(x) > 0 Case 2 (y = -1 = ŷ): f(x) < 0 ⇒ y·f(x) > 0 Case 3 (y = +1 ≠ ŷ = -1): f(x) > 0 ⇒ y·f(x) < 0 Case 4 (y = -1 ≠ ŷ = +1): f(x) < 0 ⇒ y·f(x) < 0 x1 x2

f(x) = 0 f(x) < 0 f(x) > 0

SLIDE 14

CS446 Machine Learning

Loss functions for classification

Loss = What penalty do we incur if we misclassify x ? L(y, f(x)) is the loss (aka cost) of classifier f

n example x when the true label of x is y.

We assign label ŷ = sgn(f(x)) to x

Plots of L(y, f(x)): x-axis is typically y·f(x) Today: 0-1 loss and square loss

(more loss functions later)

14

SLIDE 15

CS446 Machine Learning

0-1 Loss

15

0.5 1 1.5 2 2.5 3 3.5 4

2
1.5
1
0.5

0.5 1 1.5 2 y*f(x) 0-1 Loss

SLIDE 16

CS446 Machine Learning

0-1 Loss

L(y, f(x)) = 0 iff y = ŷ = 1 iff y ≠ ŷ L( y·f(x) ) = 0 iff y·f(x) > 0 (correctly classified) = 1 iff y·f(x) < 0 (misclassified)

16

0.5 1 1.5 2 2.5 3 3.5 4

2
1.5
1
0.5

0.5 1 1.5 2 y*f(x) 0-1 Loss

SLIDE 17

CS446 Machine Learning

Square Loss: (y – f(x))2

L(y, f(x)) = (y – f(x))2

Note: L(-1, f(x)) = (-1 – f(x))2 = ( 1 + f(x))2 = L(1, -f(x))

(the loss when y=-1 [red] is the mirror of the loss when y=+1 [blue])

17

0.5 1 1.5 2 2.5 3 3.5 4

2
1.5
1
0.5

0.5 1 1.5 2 f(x) Square loss as a function of f(x) y = +1 y = -1 0.5 1 1.5 2 2.5 3 3.5 4

2
1.5
1
0.5

0.5 1 1.5 2 y*f(x) Square loss as a function of y*f(x)

SLIDE 18

CS446 Machine Learning

The square loss is a convex upper bound on 0-1 Loss

0.5 1 1.5 2 2.5 3 3.5 4

2
1.5
1
0.5

0.5 1 1.5 2 y*f(x) Loss as a function of y*f(x) 0-1 Loss Square Loss

18

SLIDE 19

CS446 Machine Learning

Loss surface

Linear classification: Hypothesis space is parameterized by w

Plain English: Each w yields a different classifier

Error/Loss/Risk are all functions of w

19

SLIDE 20

The loss surface

20 CS440/ECE448: Intro AI

hypothesis space (empirical) error global   minimum

Learning = finding the global minimum of the loss surface

SLIDE 21

The loss surface

21 CS440/ECE448: Intro AI

hypothesis space (empirical) error global   minimum plateau local minimum

Finding the global minimum in general is hard

SLIDE 22

Convex loss surfaces

22 CS440/ECE448: Intro AI

hypothesis space (empirical)  error global   minimum

Convex functions have no local minima

SLIDE 23

CS446 Machine Learning

The risk of a classifier R(f)

The risk (aka generalization error) of a classifier f(x) = w·x is its expected loss: (= loss, averaged over all possible data sets):

R(f) = ∫ L(y, f(x)) P(x, y) dx,y

Ideal learning objective: Find an f that minimizes risk

23

SLIDE 24

CS446 Machine Learning

Aside: The i.i.d. assumption

We always assume that training and test items are independently and identically distributed (i.i.d.): – There is a distribution P(X, Y) from which the data D = {(x, y)} is generated.

Sometimes it’s useful to rewrite P(X, Y) as P(X)P(Y|X) Usually P(X, Y) is unknown to us (we just know it exists)

– Training and test data are samples drawn from the same P(X, Y): they are identically distributed – Each (x, y) is drawn independently from P(X, Y)

24

SLIDE 25

CS446 Machine Learning

The empirical risk of f(x)

The empirical risk of a classifier f(x) = w·x

n data set D = {(x1, y1),…,(xD, yD)}

is its average loss on the items in D Realistic learning objective: Find an f that minimizes empirical risk

(Note that the learner can ignore the constant 1/D)

25

RD (f) = 1 D L(yi,f(xi)

i=1 D

∑

)

SLIDE 26

CS446 Machine Learning

Empirical risk minimization

Learning: Given training data D = {(x1, y1),…,(xD, yD)}, return the classifier f(x) that minimizes the empirical risk RD( f )

26

SLIDE 27

CS446 Machine Learning

Batch learning: Gradient Descent for Least Mean Squares (LMS)

27

SLIDE 28

CS446 Machine Learning

Gradient Descent

Iterative batch learning algorithm: – Learner updates the hypothesis based on the entire training data – Learner has to go multiple times

ver the training data

Goal: Minimize training error/loss – At each step: move w in the direction of steepest descent along the error/loss surface

28

SLIDE 29

CS446 Machine Learning

Gradient Descent

Error(w): Error of w on training data wi: Weight at iteration i

29

Error(w)

w w4 w3 w2 w1

SLIDE 30

CS446 Machine Learning

Least Mean Square Error

LMS Error: Sum of square loss over all training items (multiplied by 0.5 for convenience)

D is fixed, so no need to divide by its size

Goal of learning: Find w* = argmin(Err(w))

30

Err(w) = 1 2 (yd

d∈D

∑

− ˆ yd)2

SLIDE 31

CS446 Machine Learning 31

Iterative batch learning

Initialization: Initialize w0 (the initial weight vector) For each iteration: for i = 0…T: Determine by how much to change w based on the entire data set D Δw = computeDelta(D, wi) Update w: wi+1 = update(wi, Δw)

SLIDE 32

CS446 Machine Learning

Gradient Descent: Update

1. Compute ∇Err(wi), the gradient of the

training error at wi

This requires going over the entire training data

2. Update w:

wi+1 = wi − α∇Err(wi)

α >0 is the learning rate

32

∇Err(w) = ∂Err(w) ∂w0 , ∂Err(w) ∂w1 ,..., ∂Err(w) ∂wN # $ % & ' (

T

SLIDE 33

CS446 Machine Learning

What’s a gradient?

The gradient is a vector of partial derivatives It indicates the direction of steepest increase in Err(w)

Hence the minus in the upgrade rule: wi − α∇Err(wi)

33

∇Err(w) = ∂Err(w) ∂w0 , ∂Err(w) ∂w1 ,..., ∂Err(w) ∂wN # $ % & ' (

T

SLIDE 34

CS446 Machine Learning

Computing ∇Err(wi)

34

= − (yd

d∈D

∑

− f (xd))xdi

Err(w(j))= 1 2 (yd

d∈D

∑

− f(x)d)2

= 1 2 2(yd

d∈D

∑

− f(xd)) ∂ ∂wi (yd − w⋅xd) = 1 2 ∂ ∂wi ( yd

d∈D

∑

− f(xd))2 ∂Err(w) ∂wi = ∂ ∂wi 1 2 ( yd

d∈D

∑

− f(xd))2

SLIDE 35

CS446 Machine Learning 35

Gradient descent (batch)

Initialize w0 randomly for i = 0…T: Δw = (0, …., 0) for every training item d = 1…D: f(xd) = wi·xd for every component of w j = 0…N: Δwj += α(yd − f(xd))·xdj wi+1 = wi + Δw return wi+1 when it has converged

SLIDE 36

The batch update rule for each component of w

Δwi = α (yd

d=1 D

∑

− wi ⋅xd)xdi

36

Implementing gradient descent: As you go through the training data, you can just accumulate the change in each component wi of w

SLIDE 37

CS446 Machine Learning

Learning rate and convergence

The learning rate is also called the step size.

More sophisticated algorithms (Conjugate Gradient) choose the step size automatically and converge faster.

– When the learning rate is too small, convergence is very slow – When the learning rate is too large, we may

scillate (overshoot the global minimum)

– You have to experiment to find the right learning rate for your task

37

SLIDE 38

CS446 Machine Learning

Online learning with Stochastic Gradient Descent

38

SLIDE 39

CS446 Machine Learning

Stochastic Gradient Descent

Online learning algorithm: – Learner updates the hypothesis with each training example – No assumption that we will see the same training examples again – Like batch gradient descent, except we update after seeing each example

39

SLIDE 40

CS446 Machine Learning

Why online learning?

Too much training data: – Can’t afford to iterate over everything Streaming scenario: – New data will keep coming – You can’t assume you have seen everything – Useful also for adaptation (e.g. user-specific spam detectors)

40

SLIDE 41

CS446 Machine Learning 41

Stochastic Gradient descent (online) Initialize w0 randomly for m = 0…M: f(xm) = wi·xm Δwj = α(yd − f(xm))·xmj wi+1 = wi + Δw return wi+1 when it has converged

SLIDE 42

CS446 Machine Learning

Today’s key concepts

Linear classifiers Loss functions Gradient descent Stochastic gradient descent

42