CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign
http://courses.engr.illinois.edu/cs446
- Prof. Julia Hockenmaier
juliahmr@illinois.edu
L ECTURE 4: L INEAR CLASSIFIERS Prof. Julia Hockenmaier - - PowerPoint PPT Presentation
CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 4: L INEAR CLASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu Announcements Homework 1 will
CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign
http://courses.engr.illinois.edu/cs446
juliahmr@illinois.edu
CS446 Machine Learning
Homework 1 will be out after class.
http://courses.engr.illinois.edu/cs446/Homework/HW1.pdf You have two weeks to complete the assignment. Late policy: – Up to two days late credit for the whole semester, but we don’t give any partial late credit. If you’re late for one assignment by up to 24 hours, that’s 1 of your two late credit days. – We don’t accept assignments that are more than 48 hours late.
Is everybody on Compass???
https://compass2g.illinois.edu/
Let us know if you can’t see our class.
2
CS446 Machine Learning
Decision trees for (binary) classification
Non-linear classifiers
Learning decision trees (ID3 algorithm)
Greedy heuristic (based on information gain) Originally developed for discrete features
Overfitting
What is it? How do we deal with it?
3
CS446 Machine Learning
Learning linear classifiers Batch algorithms: – Gradient descent for Least-mean squares Online algorithms: – Stochastic gradient descent
4
CS446 Machine Learning
5
Linear classifiers are defined over vector spaces Every hypothesis f(x) is a hyperplane: f(x) = w0 + wx f(x) is also called the decision boundary – Assign ŷ = 1 to all x where f(x) > 0 – Assign ŷ = -1 to all x where f(x) < 0 ŷ = sgn(f(x))
x1 x2
f(x) = 0 f(x) < 0 f(x) > 0
CS446 Machine Learning
7
x2 0 1 x1 0 0 0 1 0 0 x2 0 1 x1 0 0 0 1 1 0 x2 0 1 x1 0 0 0 1 0 1 x2 0 1 x1 0 1 0 1 0 0 x2 0 1 x1 0 0 1 1 0 0 x2 0 1 x1 0 0 0 1 1 1 x2 0 1 x1 0 1 0 1 1 0 x2 0 1 x1 0 0 1 1 1 0 x2 0 1 x1 0 1 0 1 0 1 x2 0 1 x1 0 0 1 1 0 1 x2 0 1 x1 0 1 1 1 0 0 x2 0 1 x1 0 1 0 1 1 1 x2 0 1 x1 0 0 1 1 1 1 x2 0 1 x1 0 1 1 1 1 0 x2 0 1 x1 0 1 1 1 0 1 x2 0 1 x1 0 1 1 1 1 1
CS446 Machine Learning
With w = (w1, …, wN)T and x = (x1, …, xN)T: f(x) = w0 + wx = w0 + ∑i=1…N wixi w0 is called the bias term. The canonical representation redefines w, x as w = (w0, w1, …, wN)T and x = (1, x1, …, xN)T => f(x) = w·x
8
CS446 Machine Learning
9
x1 x2
f(x) = 0 f(x) < 0 f(x) > 0
x1 x2
Input: Labeled training data D = {(x1, y1),…,(xD, yD)} plotted in the sample space X = R2 with : yi = +1, : yi = 1 Output: A decision boundary f(x) = 0 that separates the training data yi·f(xi) > 0
CS446 Machine Learning
We need a metric (aka an objective function) We would like to minimize the probability of misclassifying unseen examples, but we can’t measure that probability. Instead: minimize the number of misclassified training examples
10
CS446 Machine Learning
We need a more specific metric: There may be many models that are consistent with the training data. Loss functions provide such metrics.
11
CS446 Machine Learning
12
An example (x, y) is correctly classified by f(x) if and only if y·f(x) > 0: Case 1 (y = +1 = ŷ): f(x) > 0 ⇒ y·f(x) > 0 Case 2 (y = -1 = ŷ): f(x) < 0 ⇒ y·f(x) > 0 Case 3 (y = +1 ≠ ŷ = -1): f(x) > 0 ⇒ y·f(x) < 0 Case 4 (y = -1 ≠ ŷ = +1): f(x) < 0 ⇒ y·f(x) < 0 x1 x2
f(x) = 0 f(x) < 0 f(x) > 0
CS446 Machine Learning
Loss = What penalty do we incur if we misclassify x ? L(y, f(x)) is the loss (aka cost) of classifier f
We assign label ŷ = sgn(f(x)) to x
Plots of L(y, f(x)): x-axis is typically y·f(x) Today: 0-1 loss and square loss
(more loss functions later)
14
CS446 Machine Learning
15
0.5 1 1.5 2 2.5 3 3.5 4
0.5 1 1.5 2 y*f(x) 0-1 Loss
CS446 Machine Learning
L(y, f(x)) = 0 iff y = ŷ = 1 iff y ≠ ŷ L( y·f(x) ) = 0 iff y·f(x) > 0 (correctly classified) = 1 iff y·f(x) < 0 (misclassified)
16
0.5 1 1.5 2 2.5 3 3.5 4
0.5 1 1.5 2 y*f(x) 0-1 Loss
CS446 Machine Learning
L(y, f(x)) = (y – f(x))2
Note: L(-1, f(x)) = (-1 – f(x))2 = ( 1 + f(x))2 = L(1, -f(x))
(the loss when y=-1 [red] is the mirror of the loss when y=+1 [blue])
17
0.5 1 1.5 2 2.5 3 3.5 4
0.5 1 1.5 2 f(x) Square loss as a function of f(x) y = +1 y = -1 0.5 1 1.5 2 2.5 3 3.5 4
0.5 1 1.5 2 y*f(x) Square loss as a function of y*f(x)
CS446 Machine Learning
0.5 1 1.5 2 2.5 3 3.5 4
0.5 1 1.5 2 y*f(x) Loss as a function of y*f(x) 0-1 Loss Square Loss
18
CS446 Machine Learning
Linear classification: Hypothesis space is parameterized by w
Plain English: Each w yields a different classifier
Error/Loss/Risk are all functions of w
19
20 CS440/ECE448: Intro AI
hypothesis space (empirical) error global minimum
Learning = finding the global minimum of the loss surface
21 CS440/ECE448: Intro AI
hypothesis space (empirical) error global minimum plateau local minimum
Finding the global minimum in general is hard
22 CS440/ECE448: Intro AI
hypothesis space (empirical) error global minimum
Convex functions have no local minima
CS446 Machine Learning
The risk (aka generalization error) of a classifier f(x) = w·x is its expected loss: (= loss, averaged over all possible data sets):
Ideal learning objective: Find an f that minimizes risk
23
CS446 Machine Learning
We always assume that training and test items are independently and identically distributed (i.i.d.): – There is a distribution P(X, Y) from which the data D = {(x, y)} is generated.
Sometimes it’s useful to rewrite P(X, Y) as P(X)P(Y|X) Usually P(X, Y) is unknown to us (we just know it exists)
– Training and test data are samples drawn from the same P(X, Y): they are identically distributed – Each (x, y) is drawn independently from P(X, Y)
24
CS446 Machine Learning
The empirical risk of a classifier f(x) = w·x
is its average loss on the items in D Realistic learning objective: Find an f that minimizes empirical risk
(Note that the learner can ignore the constant 1/D)
25
RD (f) = 1 D L(yi,f(xi)
i=1 D
)
CS446 Machine Learning
Learning: Given training data D = {(x1, y1),…,(xD, yD)}, return the classifier f(x) that minimizes the empirical risk RD( f )
26
CS446 Machine Learning
27
CS446 Machine Learning
Iterative batch learning algorithm: – Learner updates the hypothesis based on the entire training data – Learner has to go multiple times
Goal: Minimize training error/loss – At each step: move w in the direction of steepest descent along the error/loss surface
28
CS446 Machine Learning
Error(w): Error of w on training data wi: Weight at iteration i
29
Error(w)
w w4 w3 w2 w1
CS446 Machine Learning
LMS Error: Sum of square loss over all training items (multiplied by 0.5 for convenience)
D is fixed, so no need to divide by its size
Goal of learning: Find w* = argmin(Err(w))
30
d∈D
CS446 Machine Learning 31
Initialization: Initialize w0 (the initial weight vector) For each iteration: for i = 0…T: Determine by how much to change w based on the entire data set D Δw = computeDelta(D, wi) Update w: wi+1 = update(wi, Δw)
CS446 Machine Learning
training error at wi
This requires going over the entire training data
wi+1 = wi − α∇Err(wi)
α >0 is the learning rate
32
∇Err(w) = ∂Err(w) ∂w0 , ∂Err(w) ∂w1 ,..., ∂Err(w) ∂wN # $ % & ' (
T
CS446 Machine Learning
The gradient is a vector of partial derivatives It indicates the direction of steepest increase in Err(w)
Hence the minus in the upgrade rule: wi − α∇Err(wi)
33
∇Err(w) = ∂Err(w) ∂w0 , ∂Err(w) ∂w1 ,..., ∂Err(w) ∂wN # $ % & ' (
T
CS446 Machine Learning
34
= − (yd
d∈D
− f (xd))xdi
Err(w(j))= 1 2 (yd
d∈D
− f(x)d)2
= 1 2 2(yd
d∈D
− f(xd)) ∂ ∂wi (yd − w⋅xd) = 1 2 ∂ ∂wi ( yd
d∈D
− f(xd))2 ∂Err(w) ∂wi = ∂ ∂wi 1 2 ( yd
d∈D
− f(xd))2
CS446 Machine Learning 35
Initialize w0 randomly for i = 0…T: Δw = (0, …., 0) for every training item d = 1…D: f(xd) = wi·xd for every component of w j = 0…N: Δwj += α(yd − f(xd))·xdj wi+1 = wi + Δw return wi+1 when it has converged
d=1 D
36
Implementing gradient descent: As you go through the training data, you can just accumulate the change in each component wi of w
CS446 Machine Learning
The learning rate is also called the step size.
More sophisticated algorithms (Conjugate Gradient) choose the step size automatically and converge faster.
– When the learning rate is too small, convergence is very slow – When the learning rate is too large, we may
– You have to experiment to find the right learning rate for your task
37
CS446 Machine Learning
38
CS446 Machine Learning
Online learning algorithm: – Learner updates the hypothesis with each training example – No assumption that we will see the same training examples again – Like batch gradient descent, except we update after seeing each example
39
CS446 Machine Learning
Too much training data: – Can’t afford to iterate over everything Streaming scenario: – New data will keep coming – You can’t assume you have seen everything – Useful also for adaptation (e.g. user-specific spam detectors)
40
CS446 Machine Learning 41
CS446 Machine Learning
Linear classifiers Loss functions Gradient descent Stochastic gradient descent
42