Pattern Recognition 2019 Linear Models for Classification (2) Ad - - PowerPoint PPT Presentation

pattern recognition 2019 linear models for classification
SMART_READER_LITE
LIVE PREVIEW

Pattern Recognition 2019 Linear Models for Classification (2) Ad - - PowerPoint PPT Presentation

Pattern Recognition 2019 Linear Models for Classification (2) Ad Feelders Universiteit Utrecht December 11, 2019 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 1 / 57 Two types of approaches to classification


slide-1
SLIDE 1

Pattern Recognition 2019 Linear Models for Classification (2)

Ad Feelders

Universiteit Utrecht

December 11, 2019

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 1 / 57

slide-2
SLIDE 2

Two types of approaches to classification

Discriminative Models (“regression”; section 4.3): only model conditional distribution of t given x. Generative Models (“density estimation”; section 4.2): model joint distribution of t and x.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 2 / 57

slide-3
SLIDE 3

Generative Models

In classification we want to estimate p(Ck|x). In generative models, we use Bayes’ rule p(Ck|x) = p(Ck)p(x|Ck) K

j=1 p(Cj)p(x|Cj)

, where p(x|Cj) are the class conditional probability distributions and p(Cj) are the unconditional (”prior”) probabilities of each class.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 3 / 57

slide-4
SLIDE 4

Generative Models

The training data are partitioned into subsets D = {D1, . . . , DK} with the same class label. Data in Dj is used to used to estimate p(x|Cj). Prior probabilities p(Cj) are estimated from observed class values. These estimates are plugged into Bayes’ formula to obtain probability estimates ˆ p(Ck|x).

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 4 / 57

slide-5
SLIDE 5

Generative Models: example (discrete features)

Test mailing data: respondents non-respondents age male female total male female total 18-25 15 10 25 7 3 10 26-35 15 20 35 10 10 20 36-50 10 10 20 10 20 30 51-64 10 5 15 40 40 80 65+ 5 5 40 20 60 total 55 45 100 107 93 200

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 5 / 57

slide-6
SLIDE 6

Generative Models: example

ˆ p(respondent) = 100/300 = 1/3 and ˆ p(non-respondent) = 2/3. respondents non-respondents age male female total male female total 18-25 0.15 0.10 0.25 0.035 0.015 0.05 26-35 0.15 0.20 0.35 0.05 0.05 0.10 36-50 0.10 0.10 0.20 0.05 0.10 0.15 51-64 0.10 0.05 0.15 0.20 0.20 0.40 65+ 0.05 0.05 0.20 0.10 0.30 total 0.55 0.45 1 0.535 0.465 1

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 6 / 57

slide-7
SLIDE 7

Using Bayes’ Rule

Estimated probability of response for a 18-25 year old male (R=Respondent, M=Male): ˆ p(R|18 − 25, M) = ˆ p(18 − 25, M|R)ˆ p(R) ˆ p(18 − 25, M) = 0.15 × 1/3 0.15 × 1/3 + 0.035 × 2/3 ≈ 0.68 Assign person to respondents because this is the class with the highest estimated probability for 18-25 year old males.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 7 / 57

slide-8
SLIDE 8

Curse of Dimensionality

D input variables with m possible values each: have to estimate mD − 1 probabilities per group. For D = 10 and m = 5: 510 − 1 = 9, 765, 624 probabilities. If N = 1000, almost all cells are empty; we have 1000/9765624 ≈ 0.0001

  • bservations per cell.

Curse of dimensionality: in high dimensions almost all of the input space is empty.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 8 / 57

slide-9
SLIDE 9

Naive Bayes Assumption

Assume the input variables are independent within each group, i.e. p(x|Ck) = p(x1|Ck)p(x2|Ck) · · · p(xD|Ck) Instead of mD − 1 parameters, we only have to estimate D(m − 1) parameters per group. So with D = 10 and m = 5, we only have to estimate 40 probabilities per group.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 9 / 57

slide-10
SLIDE 10

Using Naive Bayes

Estimated probability of response for a 18-25 year old male with naive Bayes ˆ p(R|18 − 25, M) = ˆ p(18 − 25|R)ˆ p(M|R)ˆ p(R) ˆ p(18 − 25, M) = 0.25 × 0.55 × 1/3 0.25 × 0.55 × 1/3 + 0.05 × 0.535 × 2/3 ≈ 0.72 Probability estimate is higher, but both models lead to the same allocation.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 10 / 57

slide-11
SLIDE 11

Continuous features: normal distribution

Suppose x ∼ N(µ, Σ), with x = x1 x2

  • µ =

µ1 µ2

  • Σ =

σ2

1

σ12 σ21 σ2

2

  • Correlation coefficient

ρ12 = σ12 σ1σ2

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 11 / 57

slide-12
SLIDE 12

Contour Plot 1: independent, same variance

  • 4
  • 2

2 4

  • 4
  • 2

2 4

x1 x2

µ1 = 0, µ2 = 0, ρ12 = 0,σ2

1 = σ2 2 = 1

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 12 / 57

slide-13
SLIDE 13

Contour Plot 2: positive correlation

6 8 10 12 14 20 22 24 26 28 30

x1 x2

µ1 = 10, µ2 = 25, ρ12 = 0.7,σ2

1 = σ2 2 = 1

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 13 / 57

slide-14
SLIDE 14

Contour Plot 3: negative correlation

10 12 14 16 18 20 2 4 6 8 10

x1 x2

µ1 = 15, µ2 = 5, ρ12 = −0.6,σ2

1 = σ2 2 = 1

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 14 / 57

slide-15
SLIDE 15

Multivariate Normal Distribution

D variables, i.e. x = [x1, . . . , xD]⊤ µ =      µ1 µ2 . . . µD      Σ =      σ2

1

σ12 σ13 . . . σ1D σ21 σ2

2

σ23 . . . σ2D . . . σD1 σD2 σD3 . . . σ2

D

     Formula for normal probability density: p(x) = 1 (2π)D/2|Σ|1/2 exp

  • −1

2(x − µ)⊤Σ−1(x − µ)

  • Ad Feelders

( Universiteit Utrecht ) Pattern Recognition December 11, 2019 15 / 57

slide-16
SLIDE 16

Normality Assumption in Classification

If in class k x ∼ N(µk, Σk) then the form of p(x|Ck) is 1 (2π)D/2|Σk|1/2 exp

  • −1

2(x − µk)⊤Σ−1

k (x − µk)

  • Ad Feelders

( Universiteit Utrecht ) Pattern Recognition December 11, 2019 16 / 57

slide-17
SLIDE 17

Normality Assumption in Classification

Estimating p(x|Ck) comes down to estimating the mean vector µk, and the covariance matrix Σk for each class. If there are D variables in x, then there are D means in the mean vector and D(D + 1)/2 elements in the covariance matrix, making a total of (D2 + 3D)/2 parameters to be estimated for each class.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 17 / 57

slide-18
SLIDE 18

Optimal Allocation Rule

Assign x to group k if p(Ck|x) is larger than p(Cj|x) for all j = k. Via Bayes formula this leads to the rule to assign to group k if p(x|Ck)p(Ck) > p(x|Cj)p(Cj) ∀j = k (since the denominator cancels)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 18 / 57

slide-19
SLIDE 19

Optimal Allocation Rule for Normal Densities

Fill in the formula for the normal density for p(x|Ck). Then we get the following optimal allocation rule: Assign to group k if p(Ck) (2π)D/2|Σk|1/2 exp

  • −1

2(x − µk)⊤Σ−1

k (x − µk)

  • >

p(Cj) (2π)D/2|Σj|1/2 exp

  • −1

2(x − µj)⊤Σ−1

j

(x − µj)

  • for all j = k

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 19 / 57

slide-20
SLIDE 20

Optimal Allocation Rule for Normal Densities

Take natural logarithm: ln

  • 1

(2π)D/2|Σk|1/2 exp

  • −1

2(x − µk)⊤Σ−1

k (x − µk)

  • p(Ck)
  • =

− D 2 ln(2π) − (1/2) ln |Σk| − (1/2)(x − µk)⊤Σ−1

k (x − µk) + ln p(Ck)

Cancel the terms that are common to all groups and multiply by −2: ln |Σk| + (x − µk)⊤Σ−1

k (x − µk) − 2 ln p(Ck)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 20 / 57

slide-21
SLIDE 21

Optimal Allocation Rule for Normal Densities

Discriminant function for class k dk(x) = ln |Σk| − 2 ln p(Ck) + (x − µk)⊤Σ−1

k (x − µk)

= ln |Σk| − 2 ln p(Ck) + µ⊤

k Σ−1 k µk

  • constant

− 2µ⊤

k Σ−1 k x

  • linear

+ x⊤Σ−1

k x

  • quadratic

Assign to class k if dk(x) < dj(x) for all j = k.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 21 / 57

slide-22
SLIDE 22

Estimation

Estimate p(Ck), µk, Σk from training data: ˆ p(Ck) = ˆ p(t = k) = Nk N where Nk is number of observations from group k. The mean of xi in group k is estimated by: ˆ µi,k = ¯ xi,k = 1 Nk

  • n:tn=k

xn,i for k = 1, . . . , K and i = 1, . . . , D.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 22 / 57

slide-23
SLIDE 23

Estimation

Unbiased estimate of the covariance between xi and xj in group k: ˆ Σij

k =

1 Nk − 1

  • n:tn=k

(xn,i − ¯ xi,k)(xn,j − ¯ xj,k) for k = 1, . . . , K and i, j = 1, . . . , D. If j = i, this is the variance of xi in group k.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 23 / 57

slide-24
SLIDE 24

Numeric Example

Training data: x1 x2 t 2 4 1 3 6 1 4 14 1 4 18 2 5 10 2 6 8 2

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 24 / 57

slide-25
SLIDE 25

Estimates

Group 1: ˆ p(C1) = 3 6 = 1 2 ¯ x1 = 3 8

  • ˆ

Σ1 = 1 5 5 28

  • Group 2:

ˆ p(C2) = 3 6 = 1 2 ¯ x2 = 5 12

  • ˆ

Σ2 =

  • 1

−5 −5 28

  • Ad Feelders

( Universiteit Utrecht ) Pattern Recognition December 11, 2019 25 / 57

slide-26
SLIDE 26

Plug-in estimates in the Optimal Allocation Rule

Estimated discriminant function for class k dk(x) = ln |ˆ Σk| − 2 ln ˆ p(Ck) + ¯ x⊤

k ˆ

Σ−1

k ¯

xk − 2¯ x⊤

k ˆ

Σ−1

k x + x⊤ ˆ

Σ−1

k x

Assign to class k if dk(x) < dj(x) for all j = k.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 26 / 57

slide-27
SLIDE 27

Estimates

Score for group 1: d1(x1, x2) = −29.33x1 + 4.67x2 + 9.33x2

1

+ 0.33x2

2 − 3.33x1x2 + 27.81

Score for group 2: d2(x1, x2) = −133.33x1 − 24.67x2 + 9.33x2

1

+ 0.33x2

2 + 3.33x1x2 + 483.81

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 27 / 57

slide-28
SLIDE 28

Quadratic Discriminant: decision boundary

x1 x2 2 4 6 5 10 15 20 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 28 / 57

slide-29
SLIDE 29

Classify a Point

Allocate x0 =

  • 4

14

  • to class 1 or 2.

Scores: d1(4, 14) = 3.82 d2(4, 14) = 6.48 Allocate x0 to group 1.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 29 / 57

slide-30
SLIDE 30

Classification of Training Sample

x1 x2 2 4 6 5 10 15 20

1 1 1 2 2 2

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 30 / 57

slide-31
SLIDE 31

Equal Covariances

Further reduction of number of parameters. Assumption: Σk = Σ: dk(x) = ln |Σ| − 2 ln p(Ck) + (x − µk)⊤Σ−1(x − µk) Drop ln |Σ|: dk(x) = −2 ln p(Ck) + (x − µk)⊤Σ−1(x − µk) = −2 ln p(Ck) + x⊤Σ−1x − 2µ⊤

k Σ−1x + µ⊤ k Σ−1µk

The quadratic term drops out because it is the same for all groups.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 31 / 57

slide-32
SLIDE 32

Linear Discriminant Function

Divide by −2: ak(x) = µ⊤

k Σ−1x

  • linear

− 1

2µ⊤ k Σ−1µk + ln p(Ck)

  • constant

Assign to group k if ak(x) > aj(x) ∀j = k.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 32 / 57

slide-33
SLIDE 33

Estimation

Estimate the common group-covariance matrix Σ by the pooled (within-group) sample covariance matrix ˆ Σpooled =

K

  • k=1

(Nk/N)ˆ Σk

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 33 / 57

slide-34
SLIDE 34

Numeric Example

Group 1: ˆ Σ1 = 1 5 5 28

  • Group 2:

ˆ Σ2 =

  • 1

−5 −5 28

  • Pooled:

ˆ Σpooled = 3

6 ˆ

Σ1 + 3

6 ˆ

Σ2 = 1 28

  • Ad Feelders

( Universiteit Utrecht ) Pattern Recognition December 11, 2019 34 / 57

slide-35
SLIDE 35

Numeric Example

a1(x) = [3 8] 1

1 28

x1 x2

  • − 1

2[3 8] 1

1 28

3 8

  • + ln 0.5

= 3x1 + 2 7x2 − 5 9 14 + ln 0.5 a2(x) = [5 12] 1

1 28

x1 x2

  • − 1

2[5 12] 1

1 28

5 12

  • + ln 0.5

= 5x1 + 3 7x2 − 15 1 14 + ln 0.5

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 35 / 57

slide-36
SLIDE 36

Numeric Example

Define a(x) = a1(x) − a2(x) = −2x1 − 1 7x2 + 9 6 14 Assign to group 1 if a(x) > 0 and to group 2 otherwise.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 36 / 57

slide-37
SLIDE 37

Linear Discriminant: decision boundary

x1 x2 2 4 6 5 10 15 20 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 37 / 57

slide-38
SLIDE 38

Classify a Point

Allocate x0 =

  • 4

14

  • to class 1 or 2.

a(4, 14) = −2(4) − 1 7(14) + 9 6 14 = −8 − 2 + 9 6 14 = − 8 14 Allocate x0 to class 2.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 38 / 57

slide-39
SLIDE 39

Classification of Training Sample

x1 x2 2 4 6 5 10 15 20

1 1 1 2 2 2

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 39 / 57

slide-40
SLIDE 40

Conn’s Syndrome: Linear Discriminant

sodium co2 135 140 145 150 22 24 26 28 30 32 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 40 / 57

slide-41
SLIDE 41

Conn’s Syndrome: Quadratic Discriminant

sodium co2 135 140 145 150 22 24 26 28 30 32 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 41 / 57

slide-42
SLIDE 42

Conn’s Syndrome: Linear Discriminant (red) and Logistic Regression (blue)

sodium co2 138 140 142 144 146 22 24 26 28 30 32 34

a a a a a a a a a a b b b b b b b b b b b b b b b b b b b b

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 42 / 57

slide-43
SLIDE 43

How to in R

> library(MASS) # fit quadratic discriminant > conn.qda <- qda(cause ∼ sodium + co2, data=conn.dat) # fit quadratic logistic regression > conn.logreg.quad <- glm(cause ∼ sodium * co2 + I(sodium^2)+I(co2^2), data=conn.dat,family=binomial) # plot points using class values as labels > plot(conn.dat$sodium, conn.dat$co2, type = "n") > text(conn.dat$sodium, conn.dat$co2, conn.dat$cause) # create grid of points for prediction > sod.p <- seq(135, 150, len = 100) > co2.p <- seq(20, 35, len = 100) > conn.points <- expand.grid(sodium = sod.p, co2 = co2.p)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 43 / 57

slide-44
SLIDE 44

How to in R

# create predictions at grid points > conn.qda.z <- predict(conn.qda, conn.points) > logreg.quad.z <- predict(conn.logreg.quad, conn.points, type="response") # compute difference between predicted class probabilities > zp.qda <- conn.qda.z$post[, 1] - conn.qda.z$post[, 2] > zp.logreg.quad <- logreg.quad.z-(1-logreg.quad.z) # plot contours for level=0 > contour(sod.p, co2.p, matrix(zp.qda, 100), add = TRUE, levels = 0, labcex=0.01) > contour(sod.p, co2.p, matrix(zp.logreg.quad, 100), add = TRUE, levels = 0, labcex=0.01) # add legend > legend(137.5,33.5,c("QDA","QLR"), col=c(4,2), lwd=c(2,2), lty=c(1,1))

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 44 / 57

slide-45
SLIDE 45

Decision boundaries of QDA and QLR

138 140 142 144 146 22 24 26 28 30 32 34 conn.dat$sodium conn.dat$co2 1 1 1 1 1 1 1 1 1 1 QDA QLR

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 45 / 57

slide-46
SLIDE 46

Application to Digit Recognition

> library(MASS) > optdigits.lda <- lda(V65 ∼ .,data=optdigits.train) Error in lda.default(x, grouping, ...) : variables 1 40 appear to be constant within groups > summary(optdigits.train[,1])

  • Min. 1st Qu.

Median Mean 3rd Qu. Max. > summary(optdigits.train[,40])

  • Min. 1st Qu.

Median Mean 3rd Qu. Max.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 46 / 57

slide-47
SLIDE 47

Application to Digit Recognition

> optdigits.lda <- lda(V65 ∼ ., data=optdigits.train[,-c(1,40)]) > optdigits.predict.lda <- predict(optdigits.lda,

  • ptdigits.train[,-c(1,40)])

> table(optdigits.train[,65],optdigits.predict.lda$class) 1 2 3 4 5 6 7 8 9 0 374 1 1 1 0 364 6 1 13 5 2 1 364 3 1 1 8 2 3 1 1 376 2 2 7 4 4 0 373 5 4 1 5 1 1 0 356 18 6 1 1 1 0 374 7 1 4 0 379 3 8 1 13 2 2 1 0 359 2 9 1 4 5 3 2 1 5 361

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 47 / 57

slide-48
SLIDE 48

Application to Digit Recognition

# predictions on test sample > optdigits.predict.test.lda <- predict(optdigits.lda,optdigits.test[,-c(1,40)]) # accuracy on test sample > sum(diag(table(optdigits.test[,65],

  • ptdigits.predict.test.lda$class)))/nrow(optdigits.test)

[1] 0.9387869

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 48 / 57

slide-49
SLIDE 49

Logistic Regression

The basic assumption of the logistic regression model is: ln p(C1|x) p(C2|x)

  • = w⊤x

The optimal allocation rule assigns to group 1 if w⊤x > 0, and to group 2

  • therwise.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 49 / 57

slide-50
SLIDE 50

Linear Discriminant Analysis

Under equal covariance matrix assumption: ln p(C1|x) p(C2|x)

  • = a(x) = v⊤x

Assign to group 1 if a(x) > 0, and to group 2 otherwise. Conclusion: the assumption of logistic regression is correct if the attributes are normally distributed in each group, and each group has the same covariance matrix.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 50 / 57

slide-51
SLIDE 51

Logistic Regression Assumption

In logistic regression we assume that ln p(C1|x) p(C2|x)

  • = w⊤x

which is exactly true when (a) x is normally distributed in all groups, and the groups have the same covariance matrix. (b) x consists of binary variables that are independent within each group (naive Bayes with binary variables). (c) some other cases as well.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 51 / 57

slide-52
SLIDE 52

LR and NB

If the NB assumption is correct, we have p(C1|x) p(C2|x) = D

i=1 p(xi|C1)p(C1)

D

i=1 p(xi|C2)p(C2)

=

D

  • i=1

p(xi = 1|C1) p(xi = 1|C2) xi p(xi = 0|C1) p(xi = 0|C2) 1−xi p(C1) p(C2) Take natural log: ln p(C1|x) p(C2|x) =

D

  • i=1

xi ln p(xi = 1|C1) p(xi = 1|C2) + (1 − xi) ln p(xi = 0|C1) p(xi = 0|C2) + ln p(C1) p(C2) =

D

  • i=1

xi ln p(xi = 1|C1)p(xi = 0|C2) p(xi = 1|C2)p(xi = 0|C1) + ln p(xi = 0|C1) p(xi = 0|C2) + ln p(C1) p(C2)

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 52 / 57

slide-53
SLIDE 53

LR and NB: Example

Take p(C1) = 0.6, p(x1 = 1|C1) = 0.8, p(x2 = 1|C1) = 0.6, p(x1 = 1|C2) = 0.5, and p(x2 = 1|C2) = 0.3. Give the naive Bayes linear decision boundary. ln p(C1|x) p(C2|x) = 1.386x1 + 1.253x2 − 0.916 − 0.56 + 0.405 = −1.071 + 1.386x1 + 1.253x2

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 53 / 57

slide-54
SLIDE 54

Linear Decision Boundary of naive Bayes

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x_1 x_2 1.386 x_1 + 1.253 x_2 − 1.071 = 0 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 54 / 57

slide-55
SLIDE 55

LDA Assumption

Linear discriminant analysis (LDA) assumes that x ∼ N(µj, Σ) for j = 1, 2 It then follows that the optimal allocation rule is linear in x. This assumption is more specific than the assumption of logistic regression.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 55 / 57

slide-56
SLIDE 56

Theoretical Comparison

If the LDA assumption is exactly true, then it is more efficient than LR. If the LR assumption is true but the LDA assumption not (e.g. case b), then LR is consistent but LDA not. This is only a theoretical comparison!

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 56 / 57

slide-57
SLIDE 57

Practical Comparison

LDA is quite robust against violations of the normality assumption. Especially if we are only interested in the allocation of x to a group.

Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 57 / 57