CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign
http://courses.engr.illinois.edu/cs446
- Prof. Julia Hockenmaier
juliahmr@illinois.edu
L ECTURE 10: L ARGE M ARGIN C LASSIFIERS Prof. Julia Hockenmaier - - PowerPoint PPT Presentation
CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 10: L ARGE M ARGIN C LASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu Todays class Large
CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign
http://courses.engr.illinois.edu/cs446
juliahmr@illinois.edu
CS446 Machine Learning
Large margin classifiers: – Why do we care about the margin? – Perceptron with margin – Support Vector Machines Dealing with outliers: – Soft margins
2
4
5
6
Margin m
CS446 Machine Learning
7
These decision boundaries are very close to some items in the training data. They have small margins. Minor changes in the data could lead to different decision boundaries This decision boundary is as far away from any training items as possible. It has a large margin. Minor changes in the data result in (roughly) the same decision boundary
Margin = the distance of the decision boundary to the closest items in the training data. We want to find a classifier whose decision boundary is furthest away from the nearest data
This additional requirement (bias) reduces the variance (i.e. reduces overfitting).
8 CS440/ECE448: Intro AI
Decision boundary: Hyperplane with f(x) = 0 i.e. wx + b = 0
Distance of hyperplane wx + b = 0 to origin:
−b w
CS446 Machine Learning
10
w Absolute distance
to hyperplane wx + b = 0:
wx + b w
hyperplane wx + b = 0 point x
CS446 Machine Learning
If the data are linearly separable, y(i)(wx(i) +b) > 0 Euclidean distance of x(i) to the decision boundary:
11
CS446 Machine Learning
Geometric margin (Euclidean distance)
Functional margin
γ = y(i) f(x(i)) i.e . γ = y(i) (wx(i) + b)
12
…spell out wx,‖w‖… …multiply by k/k… …move k inside… Geometric margin of x to kwx + kb = 0 Geometric margin
CS446 Machine Learning
y(i)(wx(i) + b) w = y(i) wnxn
(i) n
∑
+ b " # $ % & ' wnwn
n
∑
= ky(i) wnxn
(i) n
∑
+ b " # $ % & ' k wnwn
n
∑
= y(i) kwnxn
(i) n
∑
+ kb " # $ % & ' kwnkwn
n
∑
= y(i)(kwx(i) + kb) kw
13
Rescaling w and b by a factor k to kw and kb does not change the geometric margin (Euclidean distance):
CS446 Machine Learning
Rescaling w and b by a factor k does change the functional margin γ by a factor k: γ = y(i) (wx(i) + b) kγ = y(i) (kwx(i) + kb) The point that is closest to the decision boundary has functional margin γmin – w and b can be rescaled so that γmin = 1 – When learning w and b, we can set γmin = 1
(and still get the same decision boundary)
14
wxj = -1 = yj
wxi = +1 = yi
15
Margin m
wx = 0
wxk = +1 = yk
CS446 Machine Learning
L(y, f(x)) = max(0, 1 − yf(x))
16 1 2 3 4
0.5 1 1.5 2 yf(x) y*f(x)
Loss as a function of y*f(x) Hinge Loss Case 1: f(x) > 1 x outside of margin Hinge loss = 0 Case 2: 0< yf(x) <1: x inside of margin Hinge loss = 1-yf(x) Case 3: yf(x) < 0: x misclassified Hinge loss = 1-yf(x)
CS446 Machine Learning
17
CS446 Machine Learning
Standard Perceptron update: Update w if ym·w·xm < 0 Perceptron with Margin update: Define a functional margin γ > 0 Update w if ym·w·xm < γ
18
CS446 Machine Learning
19
wxj = -1 = yj
wxi = +1 = yi
20
Margin m
wx = 0
wxk = +1 = yk
… is defined by two parallel hyperplanes: – one that goes through the positive data points (yj = +1) that are closest to the decision boundary, and – one that goes through the negative data points (yj = −1) that are closest to the decision boundary.
21 CS440/ECE448: Intro AI
We can express the separating hyperplane in terms of the data points xj closest to the decision boundary. These data points are called the support vectors.
22 CS440/ECE448: Intro AI
wxj = -1 = yj
wxi = +1 = yi
23
Margin m
wx = 0
wxk = +1 = yk
Perceptrons:
– Weight vector has bias term w0 (x0 = dummy value 1) – Decision boundary: wx = 0
SVMs/Large Margin classifiers:
– Explicit bias term b; weight vector w = (w1…wn) – Decision boundary wx + b = 0
24 CS440/ECE448: Intro AI
CS446 Machine Learning
The functional margin of the data for (w, b) is determined by the points closest to the hyperplane Distance of x(n) to the hyperplane wx = 0: Learn w in an SVM = maximize the margin:
25
argmax
w, b
1 w min
n
y(n)(wx + b) ! " # $ % & ' ( ' ) * ' + '
γmin =min n
wx + b w
CS446 Machine Learning
Learn w in an SVM = maximize the margin: This is difficult to optimize. Let’s convert it to an equivalent problem that is easier.
26
argmax
w, b
1 w min
n
y(n)(wx + b) ! " # $ % & ' ( ' ) * ' + '
CS446 Machine Learning
Learn w in an SVM = maximize the margin: Easier equivalent problem: – We can always rescale w and b without affecting Euclidian distances. – This allows us to set the functional margin to 1: minn(y(n)(wx(n) + b) = 1
27
argmax
w, b
1 w min
n
y(n)(wx + b) ! " # $ % & ' ( ' ) * ' + '
CS446 Machine Learning
Learn w in an SVM = maximize the margin: Easier equivalent problem: a quadratic program – Setting minn(y(n)(wx(n) + b) = 1 implies (y(n)(wx(n) + b) ≥ 1 for all n – argmax(1/ww)= argmin(ww) = argmin(1/2·ww)
28
argmax
w, b
1 w min
n
y(n)(wx + b) ! " # $ % & ' ( ' ) * ' + '
argmin
w
1 2 w⋅w subject to yi(w⋅xi + b) ≥1 ∀i
The name “Support Vector Machine” stems from the fact that w* is supported by (i.e. is the linear span of) the examples that are exactly at a distance 1/||w*|| from the separating hyperplane. These vectors are therefore called support vectors. Theorem: Let w* be the minimizer of the SVM optimization problem for S = {(xi, yi)}. Let I= {i: yi (w*xi + b) = 1}. Then there exist coefficients αi > 0 such that: w* = ∑i∈ ¡I αi yi xi ¡
29
The data items x = (x1…xn) have n features The weight vector w = (w1…wn) has n elements
Learning: Find a weight wj for each feature xj Classification: Evaluate wx
30 CS440/ECE448: Intro AI
Learning: Find a weight αj ( ≥ 0) for each data point xj
This requires computing the inner product xixj between all pairs of data items xi and xj
Support vectors = the set of data points xj with non-zero weights αj
31 CS440/ECE448: Intro AI
w = α jx j
j
In the primal: Compute inner product between weight vector and test item wx = 〈w, x〉 In the dual: Compute inner products between the support vectors and test item wx = 〈w, x〉 = 〈∑ j α j x j, x〉 = ∑ j αj 〈 x j, x〉
32 CS440/ECE448: Intro AI
CS446 Machine Learning
ξi measures by how much example (xi, yi) fails to achieve margin δ
34
CS446 Machine Learning
If xi is on correct side of the margin: ξi = 0
ξi = |yi − wxi|
If ξi = 1: xi is on the decision boundary wxi = 0 If ξi > 1: xi is misclassified
Replace y(n)(wx(n) + b) ≥ 1 (hard margin) with y(n)(wx(n) + b) ≥ 1− ξ(n) (soft margin)
35
CS446 Machine Learning
ξi (slack): how far off is xi from the margin? C (cost): how much do we have to pay for misclassifying xi We want to minimize C∑i ξi and maximize the margin C controls the tradeoff between margin and training error
36
argmin
w
1 2 w⋅w +C ξi
i=1 n
subject to ξi ≥ 0 ∀i yi(w⋅xi + b) ≥ (1−ξi)∀i
CS446 Machine Learning
Now the optimization problem becomes Minw ½ ||w||2 + C ∑(x,y)∈S max(0, 1 – y wx) where the parameter C controls the tradeoff between choosing a large margin (small ||w||) and choosing a small hinge-loss.
37
CS446 Machine Learning
Traditional approach: Solve quadratic program. – This is very slow. Current approaches: Use variants of stochastic gradient descent
More on Tuesday!
38