Lecture 9: Nave Bayes Classifier (contd.) Logistic Regression - - PowerPoint PPT Presentation

lecture 9
SMART_READER_LITE
LIVE PREVIEW

Lecture 9: Nave Bayes Classifier (contd.) Logistic Regression - - PowerPoint PPT Presentation

Lecture 9: Nave Bayes Classifier (contd.) Logistic Regression Discriminative vs. Generative Classification Linear Discriminant Functions Aykut Erdem October 2016 Hacettepe University Last time Nave Bayes Classifier


slide-1
SLIDE 1

Lecture 9:

−Naïve Bayes Classifier (cont’d.) −Logistic Regression −Discriminative vs. Generative Classification −Linear Discriminant Functions

Aykut Erdem

October 2016 Hacettepe University

slide-2
SLIDE 2

Last time… Naïve Bayes Classifier

2

Given:

– Class prior P(Y) – d conditionally independent features X1,…Xd given the class label Y – For each Xi feature, we have the conditional likelihood P(Xi|Y)

Naïve Bayes Decision rule:

– – ,… –

slide by Barnabás Póczos & Aarti Singh
slide-3
SLIDE 3 3

discrete features

NB Prediction for test data: For Class Prior For Likelihood

We need to estimate these probabilities!

19

Estimators

Last time… Naïve Bayes Algorithm for discrete features

slide by Barnabás Póczos & Aarti Singh
slide-4
SLIDE 4

Last time… Text Classification

  • Antogonists and Inhibitors
  • Blood Supply
  • Chemistry
  • Drug Therapy
  • Embryology
  • Epidemiology
4

MeSH Subject Category 
 Hierarchy

?

MEDLINE Article

slide by Dan Jurafsky
slide-5
SLIDE 5

Last time… Bag of words model

5

Typical additional assumption: Position in ¡document ¡doesn’t ¡matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)

– “Bag ¡of ¡words” ¡model ¡– order of words on the page ignored The document is just a bag of words: i.i.d. words – Sounds really silly, but often works very well!

27

The probability of a document with words x1,x2,… ¡

) K(50000-1) parameters to estimate

slide by Barnabás Póczos & Aarti Singh
slide-6
SLIDE 6

The bag of words representation

6

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale

  • genre. I would recommend it to

just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.

γ( )=c

slide by Dan Jurafsky
slide-7
SLIDE 7 7

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale

  • genre. I would recommend it to

just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.

)=c γ(

The bag of words representation

slide by Dan Jurafsky
slide-8
SLIDE 8 8

x love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxx recommend xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx

)=c γ(

The bag of words representation: using a subset of words

slide by Dan Jurafsky
slide-9
SLIDE 9 9

)=c

great 2 love 2 recommend 1 laugh 1 happy 1 ... ...

γ(

The bag of words representation

slide by Dan Jurafsky
slide-10
SLIDE 10

Choosing a class: P(c|d5) P(j|d5)

1/4 * (2/9)3 * 2/9 * 2/9 ≈ 0.0001

Doc Words Class Training 1 Chinese Beijing Chinese c 2 Chinese Chinese Shanghai c 3 Chinese Macao c 4 Tokyo Japan Chinese j Test 5 Chinese Chinese Chinese Tokyo Japan ?

10

Conditional Probabilities: P(Chinese|c) = P(Tokyo|c) = P(Japan|c) = P(Chinese|j) = P(Tokyo|j) = P(Japan|j) = Priors: P(c)= P(j)= 3 4 1 4

(5+1) / (8+6) = 6/14 = 3/7 (0+1) / (8+6) = 1/14 (1+1) / (3+6) = 2/9 (0+1) / (8+6) = 1/14 (1+1) / (3+6) = 2/9 (1+1) / (3+6) = 2/9

3/4 * (3/7)3 * 1/14 * 1/14 ≈ 0.0003

∝ ∝

ˆ P(c) = Nc N

ˆ P(w | c) = count(w,c)+1 count(c)+ |V |

slide by Dan Jurafsky
slide-11
SLIDE 11

Twenty news groups results

11

Naïve Bayes: 89% accuracy

slide by Barnabás Póczos & Aarti Singh
slide-12
SLIDE 12

mean and variance for each class k and each pixel i

  • Gaussian Naïve Bayes (GNB):
  • e.g., character recognition: Xi is intensity at ith pixel

Gaussian Naïve Bayes (GNB):

Different mean and variance for each class k and each pixel i. Sometimes assume variance

  • is independent of Y (i.e., σi),
  • r independent of Xi (i.e., σk)
  • r both (i.e., σ)

What if features are continuous?

12
  • ecognition: i is intensity a

Naïve Bayes (GNB):

slide by Barnabás Póczos & Aarti Singh
slide-13
SLIDE 13 13

tinuous

ates:

Y discrete, Xi continuou

slide by Barnabás Póczos & Aarti Singh

Estimating parameters: 
 Y discrete, Xi continuous

slide-14
SLIDE 14

Maximum likelihood estimates:

14

tinuous

ates:

jth training image ith pixel in jth training image kth class

Estimating parameters: 
 Y discrete, Xi continuous

slide by Barnabás Póczos & Aarti Singh
slide-15
SLIDE 15 15

Case Study: 
 Classifying Mental States

slide-16
SLIDE 16

Example: GNB for classifying mental states

16

[Mitchell et al.]

~1 mm resolution ~2 images per sec. 15,000 voxels/image non-invasive, safe measures Blood Oxygen 
 Level Dependent (BOLD) 
 response

slide by Barnabás Póczos & Aarti Singh
slide-17
SLIDE 17
  • Brain scans can

track activation with precision and sensitivity

slide by Barnabás Póczos & Aarti Singh
slide-18
SLIDE 18

Learned Naïve Bayes Models 
 – Means for P(BrainActivity | WordCategory)

18

Pairwise classification accuracy:
 78-99%, 12 participants

Tool words Building

Building

words

Tool words

[Mitchell et al.]

slide by Barnabás Póczos & Aarti Singh
slide-19
SLIDE 19

What you should know…

Naïve Bayes classifier

  • What’s the assumption
  • Why we use it
  • How do we learn it
  • Why is Bayesian (MAP) estimation important 


Text classification

  • Bag of words model

Gaussian NB

  • Features are still conditionally independent
  • Each feature has a Gaussian distribution given class
19 slide by Barnabás Póczos & Aarti Singh
slide-20
SLIDE 20

Logistic Regression

20
slide-21
SLIDE 21

:%

Last time… Naïve Bayes

  • NB Assumption:


  • NB Classifier:



 


  • Assume parametric form for P(Xi|Y) and P(Y)
  • Estimate parameters using MLE/MAP and plug in
21 slide by Aarti Singh & Barnabás Póczos
slide-22
SLIDE 22

Gaussian class conditional densities

Gaussian Naïve Bayes (GNB)

  • There are several distributions that can lead to a linear

boundary.

  • As an example, consider Gaussian Naïve Bayes:


  • What if we assume variance is independent of class,

i.e.

22

.e.%%%%%%%%%%%?%

Gaussian class conditional densities

slide by Aarti Singh & Barnabás Póczos
slide-23
SLIDE 23

Decision(boundary:(

d

Y

i=1

P(Xi|Y = 0)P(Y = 0) =

d

Y

i=1

P(Xi|Y = 1)P(Y = 1)

23

fier!(

slide by Aarti Singh & Barnabás Póczos

GNB with equal variance is a Linear Classifier!

Decision boundary:

slide-24
SLIDE 24

Decision(boundary:(

log P(Y = 0) Qd

i=1 P(Xi|Y = 0)

P(Y = 1) Qd

i=1 P(Xi|Y = 1)

= log 1 − π π +

d

X

i=1

log P(Xi|Y = 0) P(Xi|Y = 1)

d

Y

i=1

P(Xi|Y = 0)P(Y = 0) =

d

Y

i=1

P(Xi|Y = 1)P(Y = 1)

24

fier!(

slide by Aarti Singh & Barnabás Póczos

GNB with equal variance is a Linear Classifier!

Decision boundary:

slide-25
SLIDE 25

Decision(boundary:(

log P(Y = 0) Qd

i=1 P(Xi|Y = 0)

P(Y = 1) Qd

i=1 P(Xi|Y = 1)

= log 1 − π π +

d

X

i=1

log P(Xi|Y = 0) P(Xi|Y = 1)

d

Y

i=1

P(Xi|Y = 0)P(Y = 0) =

d

Y

i=1

P(Xi|Y = 1)P(Y = 1)

25

fier!(

slide by Aarti Singh & Barnabás Póczos

GNB with equal variance is a Linear Classifier!

Decision boundary:

{

{

Constant term First-order term

slide-26
SLIDE 26

Gaussian Naive Bayes (GNB)

26

Decision(Boundary(

X = (x1, x2) P1 = P(Y = 0) P2 = P(Y = 1) p1(X) = p(X|Y = 0) ∼ N(M1, Σ1) p2(X) = p(X|Y = 1) ∼ N(M2, Σ2)

slide by Aarti Singh & Barnabás Póczos

Decision Boundary

slide-27
SLIDE 27

Generative vs. Discriminative Classifiers

  • Generative classifiers (e.g. Naïve Bayes)
  • Assume some functional form for P(X,Y) (or P(X|Y) and P(Y))
  • Estimate parameters of P(X|Y), P(Y) directly from training data
  • But arg max_Y P(X|Y) P(Y) = arg max_Y P(Y|X)
  • Why not learn P(Y|X) directly? Or better yet, why not learn

the decision boundary directly?

  • Discriminative classifiers (e.g. Logistic Regression)
  • Assume some functional form for P(Y|X) or for the decision

boundary

  • Estimate parameters of P(Y|X) directly from training data
27 slide by Aarti Singh & Barnabás Póczos
slide-28
SLIDE 28

Logistic Regression

28 8%

Assumes%the%following%func$onal%form%for%P(Y|X):%

Logis&c( func&on( (or(Sigmoid):(

Logis$c%func$on%applied%to%a%linear% func$on%of%the%data%

z% logit%(z)%

Features(can(be(discrete(or(con&nuous!(

slide by Aarti Singh & Barnabás Póczos

Assumes the following functional form for P(Y∣X): Logistic function applied to linear function of the data

Logistic
 function
 (or Sigmoid):

Features can be discrete or continuous!

slide-29
SLIDE 29

Logistic Regression is a Linear Classifier!

29 9%

Assumes%the%following%func$onal%form%for%P(Y|X):% % % % % Decision%boundary:%

(Linear Decision Boundary)

1% 1%

(Linear Decision Boundary)

slide by Aarti Singh & Barnabás Póczos

Assumes the following functional form for P(Y∣X): Decision boundary:

slide-30
SLIDE 30 30

Assumes%the%following%func$onal%form%for%P(Y|X) % % % %

1% 1% 1%

slide by Aarti Singh & Barnabás Póczos

Logistic Regression is a Linear Classifier!

Assumes the following functional form for P(Y∣X):

slide-31
SLIDE 31

Logistic Regression for more than 2 classes

31
  • Logis$c%regression%in%more%general%case,%where%%

Y% {y1,…,yK}% %for%k<K% % %

%

%for%k=K%(normaliza$on,%so%no%weights%for%this%class)% % %

slide by Aarti Singh & Barnabás Póczos

Logistic regression in more general case, where for k<K for k<K (normalization, so no weights for this class)

slide-32
SLIDE 32

Training Logistic Regression

32 12%

We’ll%focus%on%binary%classifica$on:%

% How(to(learn(the(parameters(w0,(w1,(…(wd?( Training%Data% Maximum%Likelihood%Es$mates% %

But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X)

slide by Aarti Singh & Barnabás Póczos

We’ll focus on binary classification:

Training Data Maximum Likelihood Estimates How to learn the parameters w0, w1, …, wd?

slide-33
SLIDE 33

Training Logistic Regression

33 12%

We’ll%focus%on%binary%classifica$on:%

% How(to(learn(the(parameters(w0,(w1,(…(wd?( Training%Data% Maximum%Likelihood%Es$mates% %

But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X)

But there is a problem … Don’t have a model for P(X) or P(X|Y) — only for P(Y|X)

slide by Aarti Singh & Barnabás Póczos

We’ll focus on binary classification:

How to learn the parameters w0, w1, …, wd? But there is a problem… Don’t have a model for P(X) or P(X|Y) — only for P(Y|X) Training Data Maximum Likelihood Estimates

slide-34
SLIDE 34

Training Logistic Regression

34

How(to(learn(the(parameters(w0,(w1,(…(wd?( Training%Data% Maximum%(Condi$onal)%Likelihood%Es$mates% % % %

Discrimina$ve%philosophy%–%Don’t%waste%effort%learning%P(X),% focus%on%P(Y|X)%–%that’s%all%that%maders%for%classifica$on!%

slide by Aarti Singh & Barnabás Póczos

How to learn the parameters w0, w1, …, wd?

Training Data Maximum (Conditional) Likelihood Estimates

Discriminative philosophy — Don’t waste effort learning P(X),
 focus on P(Y|X) — that’s all that matters for classification!

slide-35
SLIDE 35

Expressing Conditional log Likelihood

35

l(W) = ∑

l

Y l lnP(Y l = 1|Xl,W)+(1Y l)lnP(Y l = 0|Xl,W)

P(Y = 0|X) = 1 1+exp(w0 +∑n

i=1 wiXi)

P(Y = 1|X) = exp(w0 +∑n

i=1 wiXi)

1+exp(w0 +∑n

i=1 wiXi)

we can reexpress the log of the conditional likelihood Y can take only values 0 or 1, so only one of the two terms in the expression will be non-zero for any given Yl

slide by Aarti Singh & Barnabás Póczos
slide-36
SLIDE 36

Expressing Conditional log Likelihood

36

l(W) = ∑

l

Y l lnP(Y l = 1|Xl,W)+(1Y l)lnP(Y l = 0|Xl,W)

slide by Aarti Singh & Barnabás Póczos
slide-37
SLIDE 37

Expressing Conditional log Likelihood

37

l(W) = ∑

l

Y l lnP(Y l = 1|Xl,W)+(1Y l)lnP(Y l = 0|Xl,W) = ∑

l

Y l ln P(Y l = 1|Xl,W) P(Y l = 0|Xl,W) +lnP(Y l = 0|Xl,W)

n n

slide by Aarti Singh & Barnabás Póczos

P(Y = 0|X) = 1 1+exp(w0 +∑n

i=1 wiXi)

P(Y = 1|X) = exp(w0 +∑n

i=1 wiXi)

1+exp(w0 +∑n

i=1 wiXi)

we can reexpress the log of the conditional likelihood

slide-38
SLIDE 38

Expressing Conditional log Likelihood

38

l(W) = ∑

l

Y l lnP(Y l = 1|Xl,W)+(1Y l)lnP(Y l = 0|Xl,W) = ∑

l

Y l ln P(Y l = 1|Xl,W) P(Y l = 0|Xl,W) +lnP(Y l = 0|Xl,W) = ∑

l

Y l(w0 +

n

i

wiXl

i )ln(1+exp(w0 + n

i

wiXl

i ))

slide by Aarti Singh & Barnabás Póczos

P(Y = 0|X) = 1 1+exp(w0 +∑n

i=1 wiXi)

P(Y = 1|X) = exp(w0 +∑n

i=1 wiXi)

1+exp(w0 +∑n

i=1 wiXi)

we can reexpress the log of the conditional likelihood

slide-39
SLIDE 39

Maximizing Conditional log Likelihood

39

Bad%news:%no%closed8form%solu$on%to%maximize%l(w) Good%news:%%l(w)%is%concave%func$on%of%w(!%concave%func$ons% easy%to%op$mize%(unique%maximum)%

slide by Aarti Singh & Barnabás Póczos

Bad news: no closed-form solution to maximize l(w) Good news: l(w) is concave function of w! concave functions easy to optimize (unique maximum)

slide-40
SLIDE 40

Optimizing concave/convex functions

40
  • Condi$onal%likelihood%for%Logis$c%Regression%is%concave%
  • Maximum%of%a%concave%func$on%=%minimum%of%a%convex%func$on%

Gradient(Ascent((concave)/(Gradient(Descent((convex)(

Gradient:( Learning(rate,(η>0( Update(rule:(

slide by Aarti Singh & Barnabás Póczos
  • Conditional likelihood for Logistic Regression is concave
  • Maximum of a concave function = minimum of a convex function

Gradient Ascent (concave)/ Gradient Descent (convex)

Learning rate, ƞ>0 Gradient: Update rule:

slide-41
SLIDE 41

Gradient Ascent for Logistic Regression

41

Gradient%ascent%algorithm:%iterate%un$l%change%<%ε# # # For%i=1,…,d,%% % % repeat%%%%

Predict%what%current%weight% thinks%label%Y%should%be%

slide by Aarti Singh & Barnabás Póczos

Predict what current weight
 thinks label Y should be

  • Gradient ascent is simplest of optimisation approaches

− e.g. Newton method, Conjugate gradient ascent, IRLS (see Bishop 4.3.3)

repeat For i-1,…,d, Gradient ascent algorithm: iterate until change < ɛ

slide-42
SLIDE 42

Effect of step-size η

42 slide by Aarti Singh & Barnabás Póczos

Large ƞ → Fast convergence but larger residual error 
 Also possible oscillations Small ƞ → Slow convergence but small residual error

slide-43
SLIDE 43 43

Set(of(Gaussian(( Naïve(Bayes(parameters( (feature(variance(( independent(of(class(label)( Set(of(Logis&c(( Regression(parameters(

slide by Aarti Singh & Barnabás Póczos
  • Representation equivalence

− But only in a special case!!! (GNB with class-independent 


variances)

  • But what’s the difference???

Naïve Bayes vs. Logistic Regression

slide-44
SLIDE 44 44

Set(of(Gaussian(( Naïve(Bayes(parameters( (feature(variance(( independent(of(class(label)( Set(of(Logis&c(( Regression(parameters(

slide by Aarti Singh & Barnabás Póczos
  • Representation equivalence

− But only in a special case!!! (GNB with class-independent 


variances)

  • But what’s the difference???
  • LR makes no assumption about P(X|Y) in learning!!!
  • Loss function!!!

− Optimize different functions! Obtain different solutions

Naïve Bayes vs. Logistic Regression

slide-45
SLIDE 45

Naïve Bayes vs. Logistic Regression

45 slide by Aarti Singh & Barnabás Póczos

Consider Y Boolean, Xi continuous X=<X1 … Xd> Number of parameters:

  • NB: 4d+1 y=0,1
  • LR: d+1

Estimation method:

  • NB parameter estimates are uncoupled
  • LR parameter estimates are coupled

%%%%π, (µ1,y, µ2,y, …, µd,y),%(σ2

1,y, σ2 2,y, …, σ2 d,y)%%%%

%%%%w0,%w1,%…,%wd%

slide-46
SLIDE 46

Generative vs. Discriminative

Given infinite data (asymptotically), If conditional independence assumption holds,

Discriminative and generative NB perform similar. If conditional independence assumption does NOT holds, Discriminative outperforms generative NB.

46 slide by Aarti Singh & Barnabás Póczos

[Ng & Jordan, NIPS 2001]

slide-47
SLIDE 47

Generative vs. Discriminative

47 slide by Aarti Singh & Barnabás Póczos

[Ng & Jordan, NIPS 2001]

Given finite data (n data points, d features),

Naïve Bayes (generative) requires n = O(log d) to converge to its asymptotic error, whereas Logistic regression (discriminative) requires n = O(d).

Why? “Independent class conditional densities”

  • parameter estimates not coupled – each parameter is learnt

independently, not jointly, from training data.

slide-48
SLIDE 48 48 slide by Aarti Singh & Barnabás Póczos

Verdict

Both learn a linear decision boundary. Naïve Bayes makes more restrictive assumptions and has higher asymptotic error, BUT converges faster to its less accurate asymptotic error.

Naïve Bayes vs. Logistic Regression

slide-49
SLIDE 49

Experimental Comparison (Ng-Jordan’01)

49 20 40 60 0.25 0.3 0.35 0.4 0.45 0.5 m error pima (continuous) 10 20 30 0.2 0.25 0.3 0.35 0.4 0.45 0.5 m error adult (continuous) 20 40 60 0.2 0.25 0.3 0.35 0.4 0.45 m error boston (predict if > median price, continuous) 50 100 150 200 0.1 0.2 0.3 0.4 m error
  • ptdigits (0’s and 1’s, continuous)
50 100 150 200 0.1 0.2 0.3 0.4 m error
  • ptdigits (2’s and 3’s, continuous)
20 40 60 80 100 0.1 0.2 0.3 0.4 0.5 m error ionosphere (continuous)

UCI Machine Learning Repository 15 datasets, 8 continuous features, 7 discrete features More in 
 the paper...

slide by Aarti Singh & Barnabás Póczos

Naïve Bayes Logistic Regression

slide-50
SLIDE 50

What you should know

50 slide by Aarti Singh & Barnabás Póczos
  • LR is a linear classifier

− decision rule is a hyperplane

  • LR optimized by maximizing conditional likelihood

− no closed-form solution − concave ! global optimum with gradient ascent

  • Gaussian Naïve Bayes with class-independent variances 


representationally equivalent to LR

− Solution differs because of objective (loss) function

  • In general, NB and LR make different assumptions

− NB: Features independent given class! assumption on P(X|Y) − LR: Functional form of P(Y|X), no assumption on P(X|Y)

  • Convergence rates

− GNB (usually) needs less data − LR (usually) gets to better solutions in the limit

slide-51
SLIDE 51

Linear Discriminant 
 Functions

51
slide-52
SLIDE 52

Linear Discriminant Function

  • Linear discriminant function for a vector x



 
 where w is called weight vector, and w0 is a bias.

  • The classification function is



 
 where step function sign(·) is defined as

52

y(x) = wTx + w0

C(x) = sign(wTx + w0)

sign(a) = ( +1, a > 0 −1, a < 0

slide by Ce Liu
slide-53
SLIDE 53

wTx kwk = w0 kwk the decision surface.

Properties of Linear Discriminant Functions

  • y(x) = 0 for x on the decision surface. The normal distance

from the origin to the decision surface is

  • So w0 determines the location of the decision surface.
53 x2 x1 w x y(x) kwk x? w0 kwk y = 0 y < 0 y > 0 R2 R1
  • The decision surface, shown in red,

is perpendicular to w, and its displacement from the origin is controlled by the bias parameter w0. 


  • The signed orthogonal distance of

a general point x from the decision surface is given by y(x)/||w||


  • y(x) gives a signed measure of the

perpendicular distance r of the point x from the decision surface

slide by Ce Liu
slide-54
SLIDE 54 x2 x1 w x y(x) kwk x? w0 kwk y = 0 y < 0 y > 0 R2 R1

Properties of Linear Discriminant Functions

  • Let



 where x⊥ is the projection x on the decision surface. Then



 
 
 
 
 


  • Simpler notion: define and so that
54

x = x⊥ + r w kwk

wTx = wTx⊥ + rwTw kwk wTx + w0 = wTx⊥ + w0 + rkwk y(x) = rkwk r = y(x) kwk

define e w = (w0, w)

and e x = (1, x)

e y(x) = e wTe x

slide by Ce Liu
slide-55
SLIDE 55

Multiple Classes: Simple Extension

55

R1 R2 R3 ? C1 not C1 C2 not C2

R1 R2 R3 ? C1 C2 C1 C3 C2 C3

  • One-versus-the-rest classifier: classify Ck and samples

not in Ck.

  • One-versus-one classifier: classify every pair of classes.
slide by Ce Liu
slide-56
SLIDE 56

Multiple Classes: K-Class Discriminant

  • A single K-class discriminant comprising K linear functions
  • Decision function
  • The decision boundary between class Ck and Cj is given

by yk(x) = yj(x)

56

yk(x) = wT

k x + wk0

C(x) = k, if yk(x) > yj(x) 8 j 6= k

C C (wk wj)Tx + (wk0 wj0) = 0

slide by Ce Liu
slide-57
SLIDE 57

Property of the Decision Regions

57

Theorem

The decision regions of the K-class discriminant yk(x) = wT

k x + wk0 are singly

connected and convex.

Proof.

Suppose two points xA and xB both lie inside decision region Rk. Any point ˆ x on the line between xA and xB can be expressed as ˆ x = λxA + (1 λ)xB So yk(ˆ x) = λyk(xA) + (1 λ)yk(xB) > λyj(xA) + (1 λ)yj(xB) (8 j 6= k) = yj(ˆ x) (8 j 6= k) Therefore, the regions Rk is single connected and convex.

slide by Ce Liu
slide-58
SLIDE 58

Property of the Decision Regions

58

Theorem

The decision regions of the K-class discriminant yk(x) = wT

k x + wk0 are singly

connected and convex.

Proof.

Suppose two points xA and xB both lie inside decision region Rk. Any point ˆ x on the line between xA and xB can be expressed as ˆ x = λxA + (1 λ)xB So yk(ˆ x) = λyk(xA) + (1 λ)yk(xB) > λyj(xA) + (1 λ)yj(xB) (8 j 6= k) = yj(ˆ x) (8 j 6= k) Therefore, the regions Rk is single connected and convex.

slide by Ce Liu
slide-59
SLIDE 59

Property of the Decision Regions

59

ul- decision re- line in be

Ri Rj Rk xA xB ˆ x

Theorem

The decision regions of the K-class discriminant yk(x) = wT

k x + wk0 are singly

connected and convex.

If two points xA and xB both lie inside the same decision region Rk, then any point x that lies on the line connecting these two points must also lie in Rk, and hence the decision region must be singly connected and convex.

slide by Ce Liu
slide-60
SLIDE 60

y = wTx

Fisher’s Linear Discriminant

  • Pursue the optimal linear projection on which the two classes

can be maximally separated


  • The mean vectors of the two classes


60 −2 2 6 −2 2 4 −2 2 6 −2 2 4

Difference of means Fisher’s Linear Discriminant

m1 = 1 N1 X

n∈C1

xn, m2 = 1 N2 X

n∈C2

xn A way to view a linear classification model is in terms of dimensionality reduction.

slide by Ce Liu
slide-61
SLIDE 61

⇣ wT(m1 − m2) ⌘2 = wT(m1 − m2)(m1 − m2)Tw = wTSBw

What’s a Good Projection?

  • After projection, the two classes are separated as much as
  • possible. Measured by the distance between projected center



 
 
 
 where SB = (m1 − m2)(m1 − m2)T is called between-class covariance matrix.

  • After projection, the variances of the two classes are as small as
  • possible. Measured by the within-class covariance



 where
 


61

wTSWw

SW = X

n∈C1

(xn − m1)(xn − m1)T + X

n∈C2

(xn − m2)(xn − m2)T

slide by Ce Liu
slide-62
SLIDE 62

Fisher’s Linear Discriminant

  • Fisher criterion: maximize the ratio w.r.t. w
  • Recall the quotient rule: for
  • Setting ∇J(w) = 0, we obtain
  • Terms wTSBw, wTSWw and (m2−m1)Tw are scalars, and we only care

about directions. So the scalars are dropped. Therefore

62

J(w) = Between-class variance Within-class variance = wTSBw wTSWw

for f(x) = g(x)

h(x)

f 0(x) = g0(x)h(x) g(x)h0(x) h2(x)

(wTSBw)SWw = (wTSWw)SBw (wTSBw)SWw = (wTSWw)(m2 m1) ⇣ (m2 m1)Tw ⌘

w / S1

W (m2 m1)

slide by Ce Liu
slide-63
SLIDE 63

From Fisher’s Linear Discriminant to Classifiers

  • Fisher’s Linear Discriminant is not a classifier; it only decides
  • n an optimal projection to convert high-dimensional

classification problem to 1D.

  • A bias (threshold) is needed to form a linear classifier (multiple

thresholds lead to nonlinear classifiers). The final classifier has the form
 
 
 where the nonlinear activation function sign(·) is a step function

  • How to decide the bias w0?
63

y(x) = sign(wTx + w0)

· sign(a) = ( +1, a > 0 −1, a < 0

slide by Ce Liu