Lecture 9: Logistic Regression Discriminative vs. Generative - - PowerPoint PPT Presentation

lecture 9
SMART_READER_LITE
LIVE PREVIEW

Lecture 9: Logistic Regression Discriminative vs. Generative - - PowerPoint PPT Presentation

Lecture 9: Logistic Regression Discriminative vs. Generative Classification Linear Discriminant Functions Aykut Erdem October 2017 Hacettepe University Administrative Midterm exam will be held on November 6. Project proposals


slide-1
SLIDE 1

Lecture 9:

−Logistic Regression −Discriminative vs. Generative Classification −Linear Discriminant Functions

Aykut Erdem

October 2017 Hacettepe University

slide-2
SLIDE 2

Administrative

  • Midterm exam will be held on November 6.
  • Project proposals due is today!
  • No lecture next Thursday but we will talk

about your proposals

  • Assignment 3 will be out next week
  • Make-up lecture will be on November 11
  • will check the availability of the classrooms

2

slide-3
SLIDE 3

Last time… Naïve Bayes Classifier

3

Given:

– Class prior P(Y) – d conditionally independent features X1,…Xd given the class label Y – For each Xi feature, we have the conditional likelihood P(Xi|Y)

Naïve Bayes Decision rule:

– – ,… –

slide by Barnabás Póczos & Aarti Singh

slide-4
SLIDE 4

4

discrete features

NB Prediction for test data: For Class Prior For Likelihood

We need to estimate these probabilities!

19

Estimators

Last time… Naïve Bayes Algorithm for discrete features

slide by Barnabás Póczos & Aarti Singh

slide-5
SLIDE 5

Last time… Text Classification

  • Antogonists and Inhibitors
  • Blood Supply
  • Chemistry
  • Drug Therapy
  • Embryology
  • Epidemiology

5

MeSH Subject Category 
 Hierarchy

?

MEDLINE Article

slide by Dan Jurafsky

How to represent a text document?

slide-6
SLIDE 6

Last time… Bag of words model

6

Typical additional assumption: Position in ¡document ¡doesn’t ¡matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)

– “Bag ¡of ¡words” ¡model ¡– order of words on the page ignored The document is just a bag of words: i.i.d. words – Sounds really silly, but often works very well!

27

The probability of a document with words x1,x2,… ¡

) K(50000-1) parameters to estimate

slide by Barnabás Póczos & Aarti Singh

slide-7
SLIDE 7

Choosing a class: P(c|d5) P(j|d5)

1/4 * (2/9)3 * 2/9 * 2/9 ≈ 0.0001

Doc Words Class Training 1 Chinese Beijing Chinese c 2 Chinese Chinese Shanghai c 3 Chinese Macao c 4 Tokyo Japan Chinese j Test 5 Chinese Chinese Chinese Tokyo Japan ?

7

Conditional Probabilities: P(Chinese|c) = P(Tokyo|c) = P(Japan|c) = P(Chinese|j) = P(Tokyo|j) = P(Japan|j) = Priors: P(c)= P(j)= 3 4 1 4

(5+1) / (8+6) = 6/14 = 3/7 (0+1) / (8+6) = 1/14 (1+1) / (3+6) = 2/9 (0+1) / (8+6) = 1/14 (1+1) / (3+6) = 2/9 (1+1) / (3+6) = 2/9

3/4 * (3/7)3 * 1/14 * 1/14 ≈ 0.0003

∝ ∝

ˆ P(c) = Nc N

ˆ P(w | c) = count(w,c)+1 count(c)+ |V |

slide by Dan Jurafsky

Last time… Bag

  • f words model
slide-8
SLIDE 8
  • ecognition: i is intensity a

Naïve Bayes (GNB):

mean and variance for each class k and each pixel i

  • Gaussian Naïve Bayes (GNB):
  • e.g., character recognition: Xi is intensity at ith pixel

Gaussian Naïve Bayes (GNB):

Different mean and variance for each class k and each pixel i. Sometimes assume variance

  • is independent of Y (i.e., σi),
  • r independent of Xi (i.e., σk)
  • r both (i.e., σ)

Last time… What if features are continuous?

8

slide by Barnabás Póczos & Aarti Singh

tinuous

ates:

slide-9
SLIDE 9

Logistic Regression

9

slide-10
SLIDE 10

:%

Recap: Naïve Bayes

  • NB Assumption:


  • NB Classifier:



 


  • Assume parametric form for P(Xi|Y) and P(Y)
  • Estimate parameters using MLE/MAP and plug in

10

slide by Aarti Singh & Barnabás Póczos

slide-11
SLIDE 11

Gaussian class conditional densities

Gaussian Naïve Bayes (GNB)

  • There are several distributions that can lead to a linear

boundary.

  • As an example, consider Gaussian Naïve Bayes:


  • What if we assume variance is independent of class,

i.e.

11

.e.%%%%%%%%%%%?%

Gaussian class conditional densities

slide by Aarti Singh & Barnabás Póczos

slide-12
SLIDE 12

Decision(boundary:(

d

Y

i=1

P(Xi|Y = 0)P(Y = 0) =

d

Y

i=1

P(Xi|Y = 1)P(Y = 1)

12

fier!(

slide by Aarti Singh & Barnabás Póczos

GNB with equal variance is a Linear Classifier!

Decision boundary:

slide-13
SLIDE 13

Decision(boundary:(

log P(Y = 0) Qd

i=1 P(Xi|Y = 0)

P(Y = 1) Qd

i=1 P(Xi|Y = 1)

= log 1 − π π +

d

X

i=1

log P(Xi|Y = 0) P(Xi|Y = 1)

d

Y

i=1

P(Xi|Y = 0)P(Y = 0) =

d

Y

i=1

P(Xi|Y = 1)P(Y = 1)

13

fier!(

slide by Aarti Singh & Barnabás Póczos

GNB with equal variance is a Linear Classifier!

Decision boundary:

slide-14
SLIDE 14

Decision(boundary:(

log P(Y = 0) Qd

i=1 P(Xi|Y = 0)

P(Y = 1) Qd

i=1 P(Xi|Y = 1)

= log 1 − π π +

d

X

i=1

log P(Xi|Y = 0) P(Xi|Y = 1)

d

Y

i=1

P(Xi|Y = 0)P(Y = 0) =

d

Y

i=1

P(Xi|Y = 1)P(Y = 1)

14

fier!(

slide by Aarti Singh & Barnabás Póczos

GNB with equal variance is a Linear Classifier!

Decision boundary:

{

{

Constant term First-order term

slide-15
SLIDE 15

Gaussian Naive Bayes (GNB)

15

Decision(Boundary(

X = (x1, x2) P1 = P(Y = 0) P2 = P(Y = 1) p1(X) = p(X|Y = 0) ∼ N(M1, Σ1) p2(X) = p(X|Y = 1) ∼ N(M2, Σ2)

slide by Aarti Singh & Barnabás Póczos

Decision Boundary

slide-16
SLIDE 16

Generative vs. Discriminative Classifiers

  • Generative classifiers (e.g. Naïve Bayes)
  • Assume some functional form for P(X,Y) (or P(X|Y) and P(Y))
  • Estimate parameters of P(X|Y), P(Y) directly from training data
  • But arg max_Y P(X|Y) P(Y) = arg max_Y P(Y|X)
  • Why not learn P(Y|X) directly? Or better yet, why not learn

the decision boundary directly?

  • Discriminative classifiers (e.g. Logistic Regression)
  • Assume some functional form for P(Y|X) or for the decision

boundary

  • Estimate parameters of P(Y|X) directly from training data

16

slide by Aarti Singh & Barnabás Póczos

slide-17
SLIDE 17

Logistic Regression

17 8%

Assumes%the%following%func$onal%form%for%P(Y|X):%

Logis&c( func&on( (or(Sigmoid):(

Logis$c%func$on%applied%to%a%linear% func$on%of%the%data%

z% logit%(z)%

Features(can(be(discrete(or(con&nuous!(

slide by Aarti Singh & Barnabás Póczos

Assumes the following functional form for P(Y∣X): Logistic function applied to linear function of the data

Logistic
 function
 (or Sigmoid):

Features can be discrete or continuous!

slide-18
SLIDE 18

Logistic Regression is a Linear Classifier!

18 9%

Assumes%the%following%func$onal%form%for%P(Y|X):% % % % % Decision%boundary:%

(Linear Decision Boundary)

1% 1%

(Linear Decision Boundary)

slide by Aarti Singh & Barnabás Póczos

Assumes the following functional form for P(Y∣X): Decision boundary:

slide-19
SLIDE 19

19

Assumes%the%following%func$onal%form%for%P(Y|X) % % % %

1% 1% 1%

slide by Aarti Singh & Barnabás Póczos

Logistic Regression is a Linear Classifier!

Assumes the following functional form for P(Y∣X):

slide-20
SLIDE 20

Logistic Regression for more than 2 classes

20

  • Logis$c%regression%in%more%general%case,%where%%

Y% {y1,…,yK}% %for%k<K% % %

%

%for%k=K%(normaliza$on,%so%no%weights%for%this%class)% % %

slide by Aarti Singh & Barnabás Póczos

Logistic regression in more general case, where for k<K for k=K (normalization, so no weights for this class)

slide-21
SLIDE 21

Training Logistic Regression

21

12%

We’ll%focus%on%binary%classifica$on:%

% How(to(learn(the(parameters(w0,(w1,(…(wd?( Training%Data% Maximum%Likelihood%Es$mates% %

But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X)

slide by Aarti Singh & Barnabás Póczos

We’ll focus on binary classification:

Training Data Maximum Likelihood Estimates How to learn the parameters w0, w1, …, wd?

slide-22
SLIDE 22

Training Logistic Regression

22

12%

We’ll%focus%on%binary%classifica$on:%

% How(to(learn(the(parameters(w0,(w1,(…(wd?( Training%Data% Maximum%Likelihood%Es$mates% %

But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X)

But there is a problem … Don’t have a model for P(X) or P(X|Y) — only for P(Y|X)

slide by Aarti Singh & Barnabás Póczos

We’ll focus on binary classification:

How to learn the parameters w0, w1, …, wd? But there is a problem… Don’t have a model for P(X) or P(X|Y) — only for P(Y|X) Training Data Maximum Likelihood Estimates

slide-23
SLIDE 23

Training Logistic Regression

23

How(to(learn(the(parameters(w0,(w1,(…(wd?( Training%Data% Maximum%(Condi$onal)%Likelihood%Es$mates% % % %

Discrimina$ve%philosophy%–%Don’t%waste%effort%learning%P(X),% focus%on%P(Y|X)%–%that’s%all%that%maders%for%classifica$on!%

slide by Aarti Singh & Barnabás Póczos

How to learn the parameters w0, w1, …, wd?

Training Data Maximum (Conditional) Likelihood Estimates

Discriminative philosophy — Don’t waste effort learning P(X),
 focus on P(Y|X) — that’s all that matters for classification!

slide-24
SLIDE 24

Expressing Conditional log Likelihood

24

l(W) = ∑

l

Y l lnP(Y l = 1|Xl,W)+(1Y l)lnP(Y l = 0|Xl,W)

P(Y = 0|X) = 1 1+exp(w0 +∑n

i=1 wiXi)

P(Y = 1|X) = exp(w0 +∑n

i=1 wiXi)

1+exp(w0 +∑n

i=1 wiXi)

we can reexpress the log of the conditional likelihood Y can take only values 0 or 1, so only one of the two terms in the expression will be non-zero for any given Yl

slide by Aarti Singh & Barnabás Póczos

slide-25
SLIDE 25

Expressing Conditional log Likelihood

25

l(W) = ∑

l

Y l lnP(Y l = 1|Xl,W)+(1Y l)lnP(Y l = 0|Xl,W)

slide by Aarti Singh & Barnabás Póczos

slide-26
SLIDE 26

Expressing Conditional log Likelihood

26

l(W) = ∑

l

Y l lnP(Y l = 1|Xl,W)+(1Y l)lnP(Y l = 0|Xl,W) = ∑

l

Y l ln P(Y l = 1|Xl,W) P(Y l = 0|Xl,W) +lnP(Y l = 0|Xl,W)

n n

slide by Aarti Singh & Barnabás Póczos

P(Y = 0|X) = 1 1+exp(w0 +∑n

i=1 wiXi)

P(Y = 1|X) = exp(w0 +∑n

i=1 wiXi)

1+exp(w0 +∑n

i=1 wiXi)

we can reexpress the log of the conditional likelihood

slide-27
SLIDE 27

Expressing Conditional log Likelihood

27

l(W) = ∑

l

Y l lnP(Y l = 1|Xl,W)+(1Y l)lnP(Y l = 0|Xl,W) = ∑

l

Y l ln P(Y l = 1|Xl,W) P(Y l = 0|Xl,W) +lnP(Y l = 0|Xl,W) = ∑

l

Y l(w0 +

n

i

wiXl

i )ln(1+exp(w0 + n

i

wiXl

i ))

slide by Aarti Singh & Barnabás Póczos

P(Y = 0|X) = 1 1+exp(w0 +∑n

i=1 wiXi)

P(Y = 1|X) = exp(w0 +∑n

i=1 wiXi)

1+exp(w0 +∑n

i=1 wiXi)

we can reexpress the log of the conditional likelihood

slide-28
SLIDE 28

Maximizing Conditional log Likelihood

28

Bad%news:%no%closed8form%solu$on%to%maximize%l(w) Good%news:%%l(w)%is%concave%func$on%of%w(!%concave%func$ons% easy%to%op$mize%(unique%maximum)%

slide by Aarti Singh & Barnabás Póczos

Bad news: no closed-form solution to maximize l(w) Good news: l(w) is concave function of w! concave functions easy to optimize (unique maximum)

slide-29
SLIDE 29

Optimizing concave/convex functions

29

  • Condi$onal%likelihood%for%Logis$c%Regression%is%concave%
  • Maximum%of%a%concave%func$on%=%minimum%of%a%convex%func$on%

Gradient(Ascent((concave)/(Gradient(Descent((convex)(

Gradient:( Learning(rate,(η>0( Update(rule:(

slide by Aarti Singh & Barnabás Póczos

  • Conditional likelihood for Logistic Regression is concave
  • Maximum of a concave function = minimum of a convex function

Gradient Ascent (concave)/ Gradient Descent (convex)

Learning rate, ƞ>0 Gradient: Update rule:

slide-30
SLIDE 30

Gradient Ascent for Logistic Regression

30

Gradient%ascent%algorithm:%iterate%un$l%change%<%ε# # # For%i=1,…,d,%% % % repeat%%%%

Predict%what%current%weight% thinks%label%Y%should%be%

slide by Aarti Singh & Barnabás Póczos

Predict what current weight
 thinks label Y should be

  • Gradient ascent is simplest of optimization approaches

− e.g. Newton method, Conjugate gradient ascent, IRLS (see Bishop 4.3.3)

repeat For i-1,…,d, Gradient ascent algorithm: iterate until change < ɛ

slide-31
SLIDE 31

Effect of step-size η

31

slide by Aarti Singh & Barnabás Póczos

Large ƞ → Fast convergence but larger residual error 
 Also possible oscillations Small ƞ → Slow convergence but small residual error

slide-32
SLIDE 32

32

Set(of(Gaussian(( Naïve(Bayes(parameters( (feature(variance(( independent(of(class(label)( Set(of(Logis&c(( Regression(parameters(

slide by Aarti Singh & Barnabás Póczos

  • Representation equivalence

− But only in a special case!!! (GNB with class-independent 


variances)

  • But what’s the difference???

Naïve Bayes vs. Logistic Regression

slide-33
SLIDE 33

33

Set(of(Gaussian(( Naïve(Bayes(parameters( (feature(variance(( independent(of(class(label)( Set(of(Logis&c(( Regression(parameters(

slide by Aarti Singh & Barnabás Póczos

  • Representation equivalence

− But only in a special case!!! (GNB with class-independent 


variances)

  • But what’s the difference???
  • LR makes no assumption about P(X|Y) in learning!!!
  • Loss function!!!

− Optimize different functions! Obtain different solutions

Naïve Bayes vs. Logistic Regression

slide-34
SLIDE 34

Naïve Bayes vs. Logistic Regression

34

slide by Aarti Singh & Barnabás Póczos

Consider Y Boolean, Xi continuous X=<X1 … Xd> Number of parameters:

  • NB: 4d+1 y=0,1
  • LR: d+1

Estimation method:

  • NB parameter estimates are uncoupled
  • LR parameter estimates are coupled

%%%%π, (µ1,y, µ2,y, …, µd,y),%(σ2

1,y, σ2 2,y, …, σ2 d,y)%%%%

%%%%w0,%w1,%…,%wd%

slide-35
SLIDE 35

Generative vs. Discriminative

Given infinite data (asymptotically), If conditional independence assumption holds,

Discriminative and generative NB perform similar. If conditional independence assumption does NOT holds, Discriminative outperforms generative NB.

35

slide by Aarti Singh & Barnabás Póczos

[Ng & Jordan, NIPS 2001]

slide-36
SLIDE 36

Generative vs. Discriminative

36

slide by Aarti Singh & Barnabás Póczos

[Ng & Jordan, NIPS 2001]

Given finite data (n data points, d features),

Naïve Bayes (generative) requires n = O(log d) to converge to its asymptotic error, whereas Logistic regression (discriminative) requires n = O(d).

Why? “Independent class conditional densities”

  • parameter estimates not coupled – each parameter is learnt

independently, not jointly, from training data.

slide-37
SLIDE 37

37

slide by Aarti Singh & Barnabás Póczos

Verdict

Both learn a linear decision boundary. Naïve Bayes makes more restrictive assumptions and has higher asymptotic error, BUT converges faster to its less accurate asymptotic error.

Naïve Bayes vs. Logistic Regression

slide-38
SLIDE 38

Experimental Comparison (Ng-Jordan’01)

38

20 40 60 0.25 0.3 0.35 0.4 0.45 0.5 m error pima (continuous) 10 20 30 0.2 0.25 0.3 0.35 0.4 0.45 0.5 m error adult (continuous) 20 40 60 0.2 0.25 0.3 0.35 0.4 0.45 m error boston (predict if > median price, continuous) 50 100 150 200 0.1 0.2 0.3 0.4 m error

  • ptdigits (0’s and 1’s, continuous)

50 100 150 200 0.1 0.2 0.3 0.4 m error

  • ptdigits (2’s and 3’s, continuous)

20 40 60 80 100 0.1 0.2 0.3 0.4 0.5 m error ionosphere (continuous)

UCI Machine Learning Repository 15 datasets, 8 continuous features, 7 discrete features More in 
 the paper...

slide by Aarti Singh & Barnabás Póczos

Naïve Bayes Logistic Regression

slide-39
SLIDE 39

What you should know

39

slide by Aarti Singh & Barnabás Póczos

  • LR is a linear classifier

− decision rule is a hyperplane

  • LR optimized by maximizing conditional likelihood

− no closed-form solution − concave ! global optimum with gradient ascent

  • Gaussian Naïve Bayes with class-independent variances 


representationally equivalent to LR

− Solution differs because of objective (loss) function

  • In general, NB and LR make different assumptions

− NB: Features independent given class! assumption on P(X|Y) − LR: Functional form of P(Y|X), no assumption on P(X|Y)

  • Convergence rates

− GNB (usually) needs less data − LR (usually) gets to better solutions in the limit

slide-40
SLIDE 40

Linear Discriminant 
 Functions

40

slide-41
SLIDE 41

Linear Discriminant Function

  • Linear discriminant function for a vector x



 
 where w is called weight vector, and w0 is a bias.

  • The classification function is



 
 where step function sign(·) is defined as

41

y(x) = wTx + w0

C(x) = sign(wTx + w0)

sign(a) = ( +1, a > 0 −1, a < 0

slide by Ce Liu

slide-42
SLIDE 42

wTx kwk = w0 kwk the decision surface.

Properties of Linear Discriminant Functions

  • y(x) = 0 for x on the decision surface. The normal distance

from the origin to the decision surface is

  • So w0 determines the location of the decision surface.

42

x2 x1 w x

y(x) kwk

x?

w0 kwk

y = 0 y < 0 y > 0 R2 R1

  • The decision surface, shown in red,

is perpendicular to w, and its displacement from the origin is controlled by the bias parameter w0. 


  • The signed orthogonal distance of

a general point x from the decision surface is given by y(x)/||w||


  • y(x) gives a signed measure of the

perpendicular distance r of the point x from the decision surface

slide by Ce Liu

slide-43
SLIDE 43

x2 x1 w x

y(x) kwk

x?

w0 kwk

y = 0 y < 0 y > 0 R2 R1

Properties of Linear Discriminant Functions

  • Let



 where x⊥ is the projection x on the decision surface. Then



 
 
 
 
 


  • Simpler notion: define and so that

43

x = x⊥ + r w kwk

wTx = wTx⊥ + rwTw kwk wTx + w0 = wTx⊥ + w0 + rkwk y(x) = rkwk r = y(x) kwk

define e w = (w0, w)

and e x = (1, x)

e y(x) = e wTe x

slide by Ce Liu

slide-44
SLIDE 44

Multiple Classes: Simple Extension

44

R1 R2 R3 ? C1 not C1 C2 not C2

R1 R2 R3 ? C1 C2 C1 C3 C2 C3

  • One-versus-the-rest classifier: classify Ck and samples

not in Ck.

  • One-versus-one classifier: classify every pair of classes.

slide by Ce Liu

slide-45
SLIDE 45

Multiple Classes: K-Class Discriminant

  • A single K-class discriminant comprising K linear functions
  • Decision function
  • The decision boundary between class Ck and Cj is given

by yk(x) = yj(x)

45

yk(x) = wT

k x + wk0

C(x) = k, if yk(x) > yj(x) 8 j 6= k

C C (wk wj)Tx + (wk0 wj0) = 0

slide by Ce Liu

slide-46
SLIDE 46

Property of the Decision Regions

46

Theorem

The decision regions of the K-class discriminant yk(x) = wT

k x + wk0 are singly

connected and convex.

Proof.

Suppose two points xA and xB both lie inside decision region Rk. Any point ˆ x on the line between xA and xB can be expressed as ˆ x = λxA + (1 λ)xB So yk(ˆ x) = λyk(xA) + (1 λ)yk(xB) > λyj(xA) + (1 λ)yj(xB) (8 j 6= k) = yj(ˆ x) (8 j 6= k) Therefore, the regions Rk is single connected and convex.

slide by Ce Liu

slide-47
SLIDE 47

Property of the Decision Regions

47

Theorem

The decision regions of the K-class discriminant yk(x) = wT

k x + wk0 are singly

connected and convex.

Proof.

Suppose two points xA and xB both lie inside decision region Rk. Any point ˆ x on the line between xA and xB can be expressed as ˆ x = λxA + (1 λ)xB So yk(ˆ x) = λyk(xA) + (1 λ)yk(xB) > λyj(xA) + (1 λ)yj(xB) (8 j 6= k) = yj(ˆ x) (8 j 6= k) Therefore, the regions Rk is single connected and convex.

slide by Ce Liu

slide-48
SLIDE 48

Property of the Decision Regions

48

ul- decision re- line in be

Ri Rj Rk xA xB ˆ x

Theorem

The decision regions of the K-class discriminant yk(x) = wT

k x + wk0 are singly

connected and convex.

If two points xA and xB both lie inside the same decision region Rk, then any point x that lies on the line connecting these two points must also lie in Rk, and hence the decision region must be singly connected and convex.

slide by Ce Liu

slide-49
SLIDE 49

y = wTx

Fisher’s Linear Discriminant

  • Pursue the optimal linear projection on which the two classes

can be maximally separated


  • The mean vectors of the two classes


49

−2 2 6 −2 2 4 −2 2 6 −2 2 4

Difference of means Fisher’s Linear Discriminant

m1 = 1 N1 X

n∈C1

xn, m2 = 1 N2 X

n∈C2

xn A way to view a linear classification model is in terms of dimensionality reduction.

slide by Ce Liu

slide-50
SLIDE 50

⇣ wT(m1 − m2) ⌘2 = wT(m1 − m2)(m1 − m2)Tw = wTSBw

What’s a Good Projection?

  • After projection, the two classes are separated as much as
  • possible. Measured by the distance between projected center



 
 
 
 where SB = (m1 − m2)(m1 − m2)T is called between-class covariance matrix.

  • After projection, the variances of the two classes are as small as
  • possible. Measured by the within-class covariance



 where
 


50

wTSWw

SW = X

n∈C1

(xn − m1)(xn − m1)T + X

n∈C2

(xn − m2)(xn − m2)T

slide by Ce Liu

slide-51
SLIDE 51

Fisher’s Linear Discriminant

  • Fisher criterion: maximize the ratio w.r.t. w
  • Recall the quotient rule: for
  • Setting ∇J(w) = 0, we obtain
  • Terms wTSBw, wTSWw and (m2−m1)Tw are scalars, and we only care

about directions. So the scalars are dropped. Therefore

51

J(w) = Between-class variance Within-class variance = wTSBw wTSWw

for f(x) = g(x)

h(x)

f 0(x) = g0(x)h(x) g(x)h0(x) h2(x)

(wTSBw)SWw = (wTSWw)SBw (wTSBw)SWw = (wTSWw)(m2 m1) ⇣ (m2 m1)Tw ⌘

w / S1

W (m2 m1)

slide by Ce Liu

slide-52
SLIDE 52

From Fisher’s Linear Discriminant to Classifiers

  • Fisher’s Linear Discriminant is not a classifier; it only decides
  • n an optimal projection to convert high-dimensional

classification problem to 1D.

  • A bias (threshold) is needed to form a linear classifier (multiple

thresholds lead to nonlinear classifiers). The final classifier has the form
 
 
 where the nonlinear activation function sign(·) is a step function

  • How to decide the bias w0?

52

y(x) = sign(wTx + w0)

· sign(a) = ( +1, a > 0 −1, a < 0

slide by Ce Liu