Chapter IX: Classification* 1. Basic idea 2. Decision trees 3. - - PowerPoint PPT Presentation

chapter ix classification
SMART_READER_LITE
LIVE PREVIEW

Chapter IX: Classification* 1. Basic idea 2. Decision trees 3. - - PowerPoint PPT Presentation

Chapter IX: Classification* 1. Basic idea 2. Decision trees 3. Nave Bayes classifier 4. Support vector machines 5. Ensemble methods * Zaki & Meira: Ch. 18, 19, 21, 22; Tan, Steinbach & Kumar: Ch. 4, 5.35.6 IR&DM 13/14 16


slide-1
SLIDE 1

IR&DM ’13/14 16 January 2014 IX.4&5-

Chapter IX: Classification*

  • 1. Basic idea
  • 2. Decision trees
  • 3. Naïve Bayes classifier
  • 4. Support vector machines
  • 5. Ensemble methods

1

* Zaki & Meira: Ch. 18, 19, 21, 22; Tan, Steinbach & Kumar: Ch. 4, 5.3–5.6

slide-2
SLIDE 2

IR&DM ’13/14 16 January 2014 IX.4&5-

IX.4 Support vector machines*

  • 1. Basic idea
  • 2. Linear, separable SVM

2.1. Lagrange multipliers

  • 3. Linear, non-separable SVM
  • 4. Non-linear SVM

4.1. Kernel method

2

* Zaki & Meira: Ch. 5 & 21; Tan, Steinbach & Kumar: Ch. 5.5; Bishop: Ch. 7.1

slide-3
SLIDE 3

IR&DM ’13/14 IX.4&5- 16 January 2014

Basic idea

3

  • Find a linear hyperplane (decision boundary) that will separate the

classes

  • B1
  • B2
  • B2
  • B1

B2

Which one is better? How do you define ”better”? There are many possible answers

slide-4
SLIDE 4

IR&DM ’13/14 IX.4&5- 16 January 2014

Formal definitions

  • Let class labels be –1 and 1
  • Let classification function f be a linear function:

f(x) = wTx + b

– Here w and b are the parameters of the classifier – The class of x is sign(f(x)) – The distance of x to the hyperplane is |f(x)|/||w||

  • The decision boundary of f is the hyperplane z for

which f(z) = wTz + b = 0

  • The quality of the classifier is based on its margin

4

slide-5
SLIDE 5

IR&DM ’13/14 IX.4&5- 16 January 2014

The margin

5

  • B1

B2 b11 b12 b21 b22

margin

B1 has bigger margin ⇒ it is better The margin is twice the length of the shortest vector perpendicular to the decision boundary from the decision boundary to a data point.

slide-6
SLIDE 6

IR&DM ’13/14 IX.4&5- 16 January 2014

The margin in math

6

  • Around Bi we have two

parallel hyperplanes bi1 and bi2

– Scale w and b s.t. bi1 : wTz + b = 1 bi2 : wTz + b = –1

  • Let x1 be in bi1 and x2 be in

bi2

– The margin d is the distance from x1 to the hyperplane plus the distance from x2 to the hyperplane: d = 2/||w||

  • B1

B2 b11 b12 b21 b22

margin

This is what we want to maximize!

slide-7
SLIDE 7

IR&DM ’13/14 IX.4&5- 16 January 2014

Linear, separable SVM

  • Given the data, we want to find w and b s.t.

– wTxi + b ≥ 1 if yi = 1 – wTxi + b ≤ –1 if yi = –1

  • In addition, we want to maximize the margin

– Equals to minimizing f(w) = ||w||2/2

7

Linear, separable SVM. minw ||w||2/2 subject to yi(wTxi + b) ≥ 1, i = 1, …, N

slide-8
SLIDE 8

IR&DM ’13/14 IX.4&5- 16 January 2014

Intermezzo: Lagrange multipliers

  • A method to find extrema of constrained functions via

derivation

  • Problem: minimize f(x) subject to g(x) = 0

– Without constraint we can just derive f(x)

  • But the extrema we obtain might be unfeasible given the

constraints

  • Solution: introduce Lagrange multiplier λ

– Minimize L(x, λ) = f(x) – λg(x) – ∇f(x) – λ∇g(x) = 0

  • ∂L/∂xi = ∂f/∂xi – λ×∂g/∂xi = 0 for all i
  • ∂L/∂λ = g(x) = 0

8

The constraint!

slide-9
SLIDE 9

IR&DM ’13/14 IX.4&5- 16 January 2014

More on Lagrange multipliers

  • For many constraints, we need to add one multiplier

for each constraint

– L(x,λ) = f(x) – Σj λjgj(x) – Function L is known as the Lagrangian

  • Minimizing the unconstrained Lagrangian equals

minimizing the constrained f

– But not all solutions to ∇f(x) – Σjλj∇gj(x) = 0 are extrema – The solution is in the boundary of the constraint only if λj ≠ 0

9

slide-10
SLIDE 10

∂L ∂x = 2xy + 2λx = 0 ∂L ∂y = x2 + 2λy = 0 ∂L ∂λ = x2 + y2 − 3 = 0

IR&DM ’13/14 IX.4&5- 16 January 2014

Example

10

minimize f(x,y) = x2y subject to g(x,y) = x2 + y2 = 3 L(x,y,λ) = x2y + λ(x2 + y2 – 3) Solution: x = ±√2, y = –1

slide-11
SLIDE 11

IR&DM ’13/14 IX.4&5- 16 January 2014

Karush–Kuhn–Tucker conditions

11

  • Lagrange multipliers can only handle equality

constraints

  • Simple Karush–Kuhn–Tucker (KKT) conditions

– gi (for all i) are affine functions – λi ≥ 0 for all i – λigi(x) = 0 for all i and locally optimum x

  • If KKT conditions are satisfied, then minimizing the

Lagrangian minimizes f with inequality constraints

slide-12
SLIDE 12

Lp = 1 2 kwk2 −

N

X

i=1

λi

  • yi(wTxi + b) − 1
  • λi > 0

λi

  • yi(wTxi + b) − 1
  • = 0

∂Lp ∂w = 0 ⇒ w =

N

X

i=1

λiyixi ∂Lp ∂b = 0 ⇒

N

X

i=1

λiyi = 0

IR&DM ’13/14 IX.4&5- 16 January 2014

Solving the linear, separable SVM

12

Linear, separable SVM. minw ||w||2/2 subject to yi(wTxi + b) ≥ 1, i = 1, …, N Primal Lagrangian KKT conditions for λi w is a linear combination of xis Signed Lagrangians have to sum to 0

slide-13
SLIDE 13

Ld =

N

X

i=1

λi − 1 2

N

X

i=1 N

X

j=1

λiλjyiyjxT

i xj

IR&DM ’13/14 IX.4&5- 16 January 2014

From primal to dual to get λi

13

Lp = 1 2 kwk2 −

N

X

i=1

λi

  • yi(wTxi + b) − 1
  • ∂Lp

∂w = 0 ⇒ w =

N

X

i=1

λiyixi ∂Lp ∂b = 0 ⇒

N

X

i=1

λiyi = 0

substitute Dual Lagrangian Quadratic on λi’s Training data Linear, separable SVM, dual form. maxλ Ld = ∑i λi – 1/2∑i,j λiλjyiyjxiTxj subject to λi ≥ 0, i = 1, …, N S t a n d a r d q u a d r a t i c

  • p

t i m i z a t i

  • n

m e t h

  • d

s a r e u s e d t

  • s
  • l

v e t h i s

slide-14
SLIDE 14

IR&DM ’13/14 IX.4&5- 16 January 2014

Getting the rest…

14

  • After solving λi’s, we can substitute to get w and b

– – For b, by KKT we have λi(yi(wTxi + b) – 1) = 0 – We get one bi for each non-zero λi

  • Due to numerical problems bi’s might not be the same

⇒ take the average

  • With this, we can now classify unseen entries x by

sign(wTx + b)

w = PN

i=1 λiyixi

slide-15
SLIDE 15

IR&DM ’13/14 IX.4&5- 16 January 2014

Excuse me sir, but why…

  • …is it called support vector machine?
  • Most λi’s will be 0
  • If λi > 0, then yi(wTxi + b) = 1

⇒ xi is in the margin hyperplane

– These xi’s are called support vectors

  • Support vectors define the decision boundary

– Other have zero coefficients in the linear combination

  • Support vectors are the only things we care!

15

slide-16
SLIDE 16

IR&DM ’13/14 IX.4&5- 16 January 2014

The picture of a support vector

16

  • B1

B2 b11 b12 b21 b22

margin

A support vector And another

slide-17
SLIDE 17

IR&DM ’13/14 IX.4&5- 16 January 2014

Linear, non-separable SVM

17

  • What if the data is not linearly separable?
  • f the problem is not linearly separabl
slide-18
SLIDE 18

IR&DM ’13/14 IX.4&5- 16 January 2014

The slack variables

  • Allow misclassification but pay for it
  • The cost is defined by slack variables ξi > 0

– Change the optimization constraints to yi(wTxi + b) ≥ 1 – ξi

  • If ξi = 0, this is as before
  • If 0 < ξi < 1, the point xi is correctly classified, but within the

margin

  • If ξi ≥ 1, the point is in the decision boundary or on the wrong

side of it

  • We want to maximize the margin and minimize the

slack variables

18

slide-19
SLIDE 19

IR&DM ’13/14 IX.4&5- 16 January 2014

Linear, non-separable SVM

  • Constants C and k define the cost of misclassification

– If C = 0, no misclassification is allowed – If C → ∞, width of margin doesn’t matter – k is typically either 1 or 2

  • k = 1 is the hinge loss
  • k = 2 is the quadratic loss

19

Linear, non-separable SVM. minw,ξ (||w||2/2 + C ∑i(ξi)k) subject to yi(wTxi + b) ≥ 1 – ξi, i = 1, …, N ξi ≥ 0, i = 1, …, N

slide-20
SLIDE 20

IR&DM ’13/14 IX.4&5- 16 January 2014

Lagrangian with slack variables and k = 1

  • The Lagrange multipliers are λi and µi

– λi(yi(wTxi + b) – 1 + ξi) = 0 with λi ≥ 0 – µi(ξi – 0) = 0 with µi ≥ 0

  • The primal Lagrangian is

20

− PN

i=1 λi

  • yi(wTxi + b) − 1 + ξi
  • − PN

i=1 µiξi

Lp = 1

2 kwk2 + C PN i=1 ξi

The objective function The constraints

slide-21
SLIDE 21

∂Lp ∂w = w −

N

X

i=1

λiyixi = 0 ⇒ w =

N

X

i=1

λiyixi ∂Lp ∂b = −

N

X

i=1

λiyi = 0 ∂Lp ∂ξi = C − λi − µi = 0 ⇒ λi + µi = C LD =

N

X

i=1

λi − 1 2

N

X

i=1 N

X

j=1

λiλjyiyjxT

i xj

IR&DM ’13/14 IX.4&5- 16 January 2014

The dual

21

substitute to Lagrangian

}

Dual Lagrangian Linear, non-separable SVM, dual form. maxλ Ld = ∑i λi – 1/2∑i,j λiλjyiyjxiTxj subject to 0 ≤ λi ≤ C, i = 1, …, N The same as before! Partial derivatives

slide-22
SLIDE 22

IR&DM ’13/14 IX.4&5- 16 January 2014

Weight vector and bias

22

  • Support vectors are again those with λi > 0

– Support vector xi can be in margin or have positive slack ξi

  • Weight vector w as before: w = ∑i λiyixi
  • µi = C – λi ⇒ (C – λi)ξi = 0

– The support vectors that are in the margin are those where λi = 0 ⇒ ξi = 0 (as C > 0) – Therefore we can solve bias b as the average of bi’s: bi = yi – wTxi

slide-23
SLIDE 23

IR&DM ’13/14 IX.4&5- 16 January 2014

Non-linear SVM (a.k.a. kernel SVM)

23

What if the decision boundary is not linear?

slide-24
SLIDE 24

IR&DM ’13/14 IX.4&5- 16 January 2014

Transforming data

24

Transform the data into higher-dimensional space

(x1 + x2)4

slide-25
SLIDE 25

IR&DM ’13/14 IX.4&5- 16 January 2014

The kernel method

25

  • A non-linear decision boundary can be linear in

higher-dimensional space

  • How do we transform the data?

– Non-linear transformation Φ : ℝn → ℝm, m > n – E.g. – Now

  • We want to work with the scalar product Φ(x)TΦ(y)

– This helps if computing Φ(x) is expensive (or impossible) – Or if Φ(x) causes curse of dimensionality Φ(x1, x2) = (x2

1, x2 2,

√ 2x1, √ 2x2, 1) wTΦ(x) = w4x2

1 + w3x2 2 + w2

√ 2x1 + w1 √ 2x2 + w0

slide-26
SLIDE 26

IR&DM ’13/14 IX.4&5- 16 January 2014

The kernel

  • We replace the scalar product Φ(x)TΦ(y) with kernel

K(x, y) = Φ(x)TΦ(y)

– The kernel must be positive semidefinite:

  • K(x, y) = K(y, x) (symmetry)
  • for any non-empty {x1,…, xn}
  • Example:

– –

  • Kernel method is not limited to SVMs!

– Any method that only requires scalar products of features can use kernels

26

Pn

i=1

Pn

j=1 aiajK(xi, xj) > 0

Φ(x1, x2) = (1, √ 2x1, √ 2x2, √ 2x1x2, x2

1, x2 2)T

K(x, y) = 1 + 2x1y1 + 2x2y2 + 2x1y1x2y2 + x2

1y2 1 + x2 2y2 2

slide-27
SLIDE 27

IR&DM ’13/14 IX.4&5- 16 January 2014

Some kernel functions

  • The (inhomogeneous) quadratic kernel (in ℝ2):

– – In general K(x,y) = (xTy + 1)p

  • The Gaussian kernel:

– – The mapping Φ corresponding to the Gaussian kernel has infinite dimensionality

  • The sigmoid kernel:

27

K(x, y) = 1 + 2x1y1 + 2x2y2 + 2x1y1x2y2 + x2

1y2 1 + x2 2y2 2

K(x, y) = exp ⇣ − kx−yk2

2σ2

⌘ K(x, y) = tanh(kxTy − δ)

slide-28
SLIDE 28

IR&DM ’13/14 IX.4&5- 16 January 2014

Kernels and non-linear SVM

28

Non-linear, non-separable SVM. minw,ξ (||w||2/2 + C ∑i(ξi)k) subject to yi(wTΦ(xi) + b) ≥ 1 – ξi, i = 1, …, N ξi ≥ 0, i = 1, …, N

Ld =

N

X

i=1

λi − 1 2

N

X

i=1 N

X

j=1

λiλjyiyjΦ(xi)TΦ(xj) =

N

X

i=1

λi − 1 2

N

X

i=1 N

X

j=1

λiλjyiyjK(xi, xj) Dual Lagrangian:

The kernel

slide-29
SLIDE 29

w =

N

X

i=1

λiyiΦ(xi) b = 1 n X

i:λi>0

yi − X

i:λi>0

wTΦ(xi) ! ˆ y = sign(wTΦ(z) + b) = sign X

i:λi>0

λiyiK(xi, z) + b ! b = 1 n   X

i:λi>0

yi − X

i:λi>0

X

j:λj>0

λiyiK(xi, xj)  

IR&DM ’13/14 IX.4&5- 16 January 2014

Solving weight and bias with kernel

29

n = # of support vectors Has Φ substitute Has kernel Classify new z: substitute

slide-30
SLIDE 30

IR&DM ’13/14 IX.4&5- 16 January 2014

Summary of SVM

30

  • Can find globally optimum solution to the loss

function

  • Maximizing the margin helps with overfitting

– But wrong selection of constants C and k will have adverse effects

  • Can handle non-linear data

– Kernel function must be decided – Applicable to other methods, too

  • Can be extended to categorical data and multiple

classes

slide-31
SLIDE 31

IR&DM ’13/14 16 January 2014 IX.4&5-

IX.5 Ensemble methods*

  • 1. Basic idea
  • 2. Bagging
  • 3. Boosting

3.1. AdaBoost

31

* Zaki & Meira: Ch. 22; Tan, Steinbach & Kumar: Ch. 5.6; Bishop: Ch. 14.2–3

slide-32
SLIDE 32

IR&DM ’13/14 IX.4&5- 16 January 2014

Basic idea

  • Suppose we have multiple classifiers for the data

– Each is good in some parts and bad in other parts

  • Can we get better results if we combine these

methods?

  • How can combine them?

– Simple committee solution: take the majority label – If we have confidence to the classifiers, we can weight their solutions and take the weighted majority label

32

slide-33
SLIDE 33

IR&DM ’13/14 IX.4&5- 16 January 2014

Rationale of ensembles

  • 25 binary classifiers (base classifiers)

– Each base classifier has error rate 0.35 – Majority vote to select the class label

  • If the base classifiers are identical, the ensemble will

have error rate 0.35

– If the base classifiers are independent, the ensemble will have error rate Pr[X ≥ 13] with X ~ Binom(25, 0.35) –

  • Two conditions:

– Base classifiers must be (reasonably) independent – Base classifiers must do better than purely random

33

Pr[X > 13] = P25

i=13

25

i

  • 0.35i(1 − 0.35)25−i = 0.06
slide-34
SLIDE 34

0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 0,5 0,08 0,16 0,24 0,32 0,4 0,48

IR&DM ’13/14 IX.4&5- 16 January 2014

Ensemble error example

34

n = 25, p = 0.35 Base classifier error Ensemble classifier error

slide-35
SLIDE 35

IR&DM ’13/14 IX.4&5- 16 January 2014

How to make independent classifiers

35

  • Manipulate the training set

– Bagging – Boosting

  • Manipulate the input features

– Random forest

  • Manipulate the class labels

– Different splits from multi-class to two-class

  • Manipulate the learning algorithm

– Add randomness

slide-36
SLIDE 36

IR&DM ’13/14 IX.4&5- 16 January 2014

Bagging (a.k.a. bootstrapping)

  • Sample the data uniformly with replacements

– Each sample Di has the same size as the original data D – Each data point x ∈ D has Pr[x ∈ Di] = 1 – (1 – 1/|D|)|D| → 1 – 1/e ≈ 0.632 when |D| → ∞

  • Final classifier usually uses the majority

voting

36

Image: http://www.lemen.com

slide-37
SLIDE 37

IR&DM ’13/14 IX.4&5- 16 January 2014

Bagging example

37

x y 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 +1 +1 +1 –1 –1 –1 –1 +1 ¡ +1 +1

Single one-split decision tree’s accuracy ≤ 70% 10 bagging samples:

x y 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 1.0 +1 +1 +1 +1 –1 –1 –1 –1 ¡ +1 +1 x y 0.1 0.2 0.3 0.4 0.5 0.8 0.9 1.0 1.0 1.0 +1 +1 +1 –1 –1 +1 +1 +1 ¡ +1 +1 x y 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9 +1 +1 +1 +1 +1 +1 +1 +1 ¡ +1 +1

x ≤ 0.35

x Σ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 2 2 2 –6 –6 –6 –6 2 2 2

+1 +1 +1 +1 +1 +1 –1 –1 –1 –1 Estim: 2-level decision tree!

slide-38
SLIDE 38

IR&DM ’13/14 IX.4&5- 16 January 2014

Boosting

38

  • In bagging, each item gets sampled u.a.r. independent
  • f how hard they’re to classify

– Easy items don’t need ensembles – Shouldn’t we concentrate on hard cases?

  • Boosting adds weights to the training items

– More often misclassified items get bigger weights – Weights can be used to learn biased classifiers

  • Pay more for misclassifying heavier items

– Weights can be used to weight bootstrap sampling

  • Heavier items get selected more often
slide-39
SLIDE 39

IR&DM ’13/14 IX.4&5- 16 January 2014

Basic idea

  • 1. Initialize all weights to 1/N
  • Uniform distribution
  • 2. Perform classification
  • 3. Increase the weights of misclassified items and

reduce the weight of correctly classified items

  • 4. Aggregate the predictions
  • Methods differ on how the weights are changed and

how the aggregation works

39

slide-40
SLIDE 40
  • Let Ci be a base classifier

– Error rate of Ci is

  • 1(p) = 1 if p is true and 0 o/w
  • wj is the weight of training item (xj, yj)

– Importance of Ci is – Weight of (xj, yj) for iteration i+1 is

  • Zi is a normalization constant s.t. wj’s sum to 1

– If error rate goes above 0.5, all weights are set to 1/N

  • For aggregation, each classifier is weighted by αi

IR&DM ’13/14 IX.4&5- 16 January 2014

AdaBoost

40

✏i = N−1 PN

j=1 wj1(Ci(xj) 6= yj)

↵i = ln(1/✏i − 1)/2 wi+1

j

= w(i)

j

Zi ⇥

  • e−αi

if Ci(xj) = yj eαi if Ci(xj) 6= yj

slide-41
SLIDE 41

IR&DM ’13/14 IX.4&5- 16 January 2014

Error rate

  • Let εi be the error rate of classifier i in boosting

– Let εi < 0.5 for all i and write εi = 0.5 – γi

  • The total error rate of the ensemble can now be

bounded by

– The error decreases exponentially if γi > γ* > 0 for all i’s ⇒ Fast convergence

41

✏E 6 Y

i

p ✏i(1 − ✏i) = Y

i

q 1/4 − 2

i

= exp −O X

i

2

i

  • !
slide-42
SLIDE 42

IR&DM ’13/14 IX.4&5- 16 January 2014

Summary of ensemble methods

  • Combining many classifiers usually helps
  • Base classifiers must be reasonably independent
  • Bagging is simple, but doesn’t improve bad classifiers

much

  • Boosting helps in particular with almost-random

classifiers

– Boosting is prone to overfitting

  • How many classifiers can be combined depends on

how much time we can spend

42