Linear Discrimination Steven J Zeil Old Dominion Univ. Fall 2010 - - PowerPoint PPT Presentation

linear discrimination
SMART_READER_LITE
LIVE PREVIEW

Linear Discrimination Steven J Zeil Old Dominion Univ. Fall 2010 - - PowerPoint PPT Presentation

Discriminant-Based Classification Posteriors Logistic Discrimination Linear Discrimination Steven J Zeil Old Dominion Univ. Fall 2010 1 Discriminant-Based Classification Posteriors Logistic Discrimination Linear Discrimination


slide-1
SLIDE 1

Discriminant-Based Classification Posteriors Logistic Discrimination

Linear Discrimination

Steven J Zeil

Old Dominion Univ.

Fall 2010

1

slide-2
SLIDE 2

Discriminant-Based Classification Posteriors Logistic Discrimination

Linear Discrimination

1

Discriminant-Based Classification Linearly Separable Systems Pairwise Separation

2

Posteriors

3

Logistic Discrimination

2

slide-3
SLIDE 3

Discriminant-Based Classification Posteriors Logistic Discrimination

Discriminant-Based Classification

Likelihood-based: Assume a model for p( x|Ci). Use Bayes’ rule to calculate P(Ci| x) gi( x) = log P(Ci| x) Discriminant-based: Assume a model for gi( x| φi). Vapnik: Estimating the class densities is a harder problem than estimating the class discriminants. It does not make sense to solve a hard problem to solve an easier one.

3

slide-4
SLIDE 4

Discriminant-Based Classification Posteriors Logistic Discrimination

Linear Discrimination

Linear discriminant: gi( x| wi, wi0) = wT

i

x + wi0 =

d

  • j=1

wijxj + wi0 Advantages:

Simple: O(d) space/computation

4

slide-5
SLIDE 5

Discriminant-Based Classification Posteriors Logistic Discrimination

Linear Discrimination

Linear discriminant: gi( x| wi, wi0) = wT

i

x + wi0 =

d

  • j=1

wijxj + wi0 Advantages:

Simple: O(d) space/computation Knowledge extraction: Weights sizes give an indication of significance of contribution of each attribute

4

slide-6
SLIDE 6

Discriminant-Based Classification Posteriors Logistic Discrimination

Linear Discrimination

Linear discriminant: gi( x| wi, wi0) = wT

i

x + wi0 =

d

  • j=1

wijxj + wi0 Advantages:

Simple: O(d) space/computation Knowledge extraction: Weights sizes give an indication of significance of contribution of each attribute Optimal when p( x|Ci) are Gaussian with shared covariance matrix

4

slide-7
SLIDE 7

Discriminant-Based Classification Posteriors Logistic Discrimination

Linear Discrimination

Linear discriminant: gi( x| wi, wi0) = wT

i

x + wi0 =

d

  • j=1

wijxj + wi0 Advantages:

Simple: O(d) space/computation Knowledge extraction: Weights sizes give an indication of significance of contribution of each attribute Optimal when p( x|Ci) are Gaussian with shared covariance matrix Useful when classes are (almost) linearly separable

4

slide-8
SLIDE 8

Discriminant-Based Classification Posteriors Logistic Discrimination

More General Linear Models

gi( x| wi, wi0) =

d

  • j=1

wijxj + wi0 We can replace the xi on the right by any linearly independent set

  • f basis functions:

g( x) = g1( x) − g2( x) =

  • wT

x + w0 Choose C1 if g( x) > 0 C2

  • w

5

slide-9
SLIDE 9

Discriminant-Based Classification Posteriors Logistic Discrimination

Geometric Interpretation

Rewrite x as

  • x =

xp + r

  • w

|| w|| where xp is the projection of x onto the hyperplane g( x) = 0

  • w is normal to

the hyperplane r = g(

x) || w|| is the

(signed) distance

6

slide-10
SLIDE 10

Discriminant-Based Classification Posteriors Logistic Discrimination

Linearly Separable Systems

For multiple classes with gi( x| wi, wi0) = wT

i

x + wi0 with the wi normalized Choose Ci if gi( x) =

k

max

j=1 gj(

x)

7

slide-11
SLIDE 11

Discriminant-Based Classification Posteriors Logistic Discrimination

Pairwise Separation

If not linearly separable, compute discriminants between each pair of classes: gij( x| wij, wij0) = wT

ij

x+wij0 Choose Ci if ∀j = i, gij( x) > 0

8

slide-12
SLIDE 12

Discriminant-Based Classification Posteriors Logistic Discrimination

Revisiting Parametric Methods

When p( x|Ci) ∼ N( µ, Σ), gi( x| wi, wi0) = wT

i

x + wi0

  • wi = Σ−1

µi wi0 = −1 2 µT

i Σ−1

µi + log P(Ci) Let y ≡ P(C1| x). Then P(C2| x) = 1 − y We choose C1 if y > 0.5, or alternatively, if

y 1−y > 1.

Equivalently, if log

  • y

1−y

  • > 0

The latter is called the log odds of y or logit.

9

slide-13
SLIDE 13

Discriminant-Based Classification Posteriors Logistic Discrimination

log odds

For 2 normal classes with a shared cov. matrix, the log odds is linear logit(P(C1| x)) = log P(C1| x) P(C2| x) = log P( x|C1) P( x|C2) + log P(C1) P(C2) = log P( x|C1) − log P( x|C2) + log P(C1) P(C2) The P( x|C) terms are exponential in x (Gaussian pdf), so the log is linear logit(P(C1| x)) = wT x + w0 with w = Σ−1( µ1 − µ2), w0 = − 1

2(

µ1 + µ2)TΣ−1( µ1 + µ2)

10

slide-14
SLIDE 14

Discriminant-Based Classification Posteriors Logistic Discrimination

logistic

The inverse of the logit function: logit(P(C1| x)) = wT x + w0 is called the logistic a.k.a. the sigmoid: P(C1| x) = sigmoid( wT x + w0) = 1 1 + exp[ wT x + w0]

11

slide-15
SLIDE 15

Discriminant-Based Classification Posteriors Logistic Discrimination

Using the Sigmoid

During training During training, estimate

  • m1,

m2, S, then compute the w During testing, either

Calculate g( x| w, w0) = w T x + w0 and choose Ci if g( x) > 0, or Calculate y = sigmoid( w T x + w0) and choose Ci if y > 0.5

12

slide-16
SLIDE 16

Discriminant-Based Classification Posteriors Logistic Discrimination

Logistic Discrimination

For two classes, assume the log likelihood ratio is linear log p(

x|C1) p( x|C2) =

wT x + w0 logit(p(C1)) = wT x + w0 y = ˆ P(C1| x) = 1 1 + exp [ wT x + w0] Likelihood l( w, w0|X) =

  • t

(yt)rt(1 − yt)1−rt Error (“cross-entropy”) E( w, w0|X) = −

  • t

rt log yt + (1 − rt) log (1 − yt) Train by numerical optimization to minimize E

13

slide-17
SLIDE 17

Discriminant-Based Classification Posteriors Logistic Discrimination

Estimating w

14

slide-18
SLIDE 18

Discriminant-Based Classification Posteriors Logistic Discrimination

Multiple classes

For K classes, take CK as a reference class log p(

x|Ci) p( x|CK ) =

wT x + w0 p(Ci| x) p(CK| x) = exp

  • wT

x + w0

  • yi = ˆ

P(Ci| x) = exp

  • wT

i

x + wi0

  • 1 + K

j=1 exp

  • wT

j

x + wj0

  • This is called the softmax function because exponentiation

combined with normalization tends to exaggerate weight of the maximum term Likelihood l( w, w0|X) =

  • t
  • i

(yt

i )rt

i 15

slide-19
SLIDE 19

Discriminant-Based Classification Posteriors Logistic Discrimination

Multiple classes (cont.)

Error (“cross-entropy)”) E( w, w0|X) = −

  • t
  • i

rt

i log yt i

Train by numerical optimization to minimize E

16

slide-20
SLIDE 20

Discriminant-Based Classification Posteriors Logistic Discrimination

Softmax Classification

17

slide-21
SLIDE 21

Discriminant-Based Classification Posteriors Logistic Discrimination

Softmax Discriminants

18