PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE - - PowerPoint PPT Presentation

plugin classifiers naive bayes lda plugin classifiers
SMART_READER_LITE
LIVE PREVIEW

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE - - PowerPoint PPT Presentation

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION LOGISTIC REGRESSION Matthieu R Bloch Tuesday, January 28, 2020 1 LOGISTICS LOGISTICS TAs and Office hours Monday: Mehrdad (TSRB 523a) -


slide-1
SLIDE 1

Matthieu R Bloch Tuesday, January 28, 2020

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION LOGISTIC REGRESSION

1

slide-2
SLIDE 2

LOGISTICS LOGISTICS

TAs and Office hours Monday: Mehrdad (TSRB 523a) - 2pm-3:15pm Tuesday: TJ (VL C449 Cubicle D) - 1:30pm - 2:45pm Wednesday: Matthieu (TSRB 423) - 12:00:pm-1:15pm Thursday: Hossein (VL C449 Cubicle B): 10:45pm - 12:00pm Friday: Brighton (TSRB 523a) - 12pm-1:15pm Homework 1 posted on Canvas Due Wednesday January 29, 2020 (11:59PM EST) (Wednesday February 5, 2020 for DL) Lecture notes updated Versions 1.1 posted for lectures 1,3,4,5 (small typos) Logistics for homework submission Upload separate PDF file Put problems in order Show your work (“Similar to above, etc.” does not show work) Include listing of code (example on overleaf)

2

slide-3
SLIDE 3

RECAP: NAIVE BAYES RECAP: NAIVE BAYES

Consider (random) feature vector and the label Naive asssumption: Given , the features

  • f are independent, i.e.,

Main benefit: only need univariate densities , combine discrete/continous features Procedure Estimate a priori class probabilities: for Esimate class conditional densities for and Lemma. The maximum likelihood estimate of is where What about ? Continuous features: oen Gaussian, use ML estimate Discrete (categorical) features: oen Multinomial, use ML estimate

x = ∈ [ , ⋯ , ] x1 xd

Rd y y {xi}d

i=1

x = Px|y ∏d

i=1 P |y xi

P

|y xi

πk 0 ≤ k ≤ K − 1 (x|k) p

|y xi

1 ≤ i ≤ d 0 ≤ k ≤ K − 1 πk = π ^k

Nk N

≜ |{ : = k}| Nk yi yi Px|y

3

slide-4
SLIDE 4

NAIVE BAYES (CT’D) NAIVE BAYES (CT’D)

Assume th feature takes distinct values Lemma. The maximum likelihood estimate of is where The naive bayes estimator is Naive Bayes can be completely wrong! Example bivariate Gaussian case

j xj J {0, … , J − 1} (ℓ|k) P

|y xj

(ℓ|k) = P ˆ

|y xj N (j)

ℓ,k

Nk

≜ |{x : y = k and = ℓ}| N (j)

ℓ,k

xj (x) = ( |k) hNB argmaxk π ^k ∏d

j=1 P

ˆ

|y xj

xj

4

slide-5
SLIDE 5

NAIVE BAYES AND BAG OF WORDS NAIVE BAYES AND BAG OF WORDS

Classification of documents into categories (politics, sports, etc.) Document as vector with the number of occurences of word in document Model documents of length and assume words are distributed among the words independently at random (multinonial distribution) Estimate parameters Compute the ML estimate of the document classes Compute the ML estimate of the probability that word occurs in class across all documents: Run classifier: Weakness of approach: some words may not show up at training but show up at testing Use Laplace smoothing

x = [ , ⋯ , ] x1 xd

xj j n d = π ^k

Nk N

j k = μ ^j,k ℓ ∑ℓ N (j)

ℓ,k

ℓ ∑d

j=1 ∑ℓ N (j) ℓ,k

= h ^NB argmaxk π ^k ∏d

j=1 (

) μ ^j,k

xj

= μ ^j,k 1 + ℓ ∑ℓ N (j)

ℓ,k

d + ℓ ∑d

j=1 ∑ℓ N (j) ℓ,k

slide-6
SLIDE 6

LINEAR DISCRIMINANT ANALYSIS (LDA) LINEAR DISCRIMINANT ANALYSIS (LDA)

Consider (random) feature vector and the label Assumption: Given , the feature vector have a Gaussian distribution The mean is class dependent but the covariance matrix is not Estimate parameters from data (recall assumption about covariance matrix) Lemma. The LDA classifier is For , the LDA is a linear classifier

x = ∈ [ , ⋯ , ] x1 xd

Rd y y ∼ N( , Σ) Px|y μk ϕ(x; μ, Σ) ≜ exp(− (x − μ (x − μ)) 1 (2π |Σ )

d 2

|

1 2

1 2 )⊺Σ−1 = π ^k

Nk N

= μ ^k

1 Nk ∑i: =k yi

xi = ( − )( − Σ ^

1 N ∑K−1 k=0 ∑i: =k yi

xi μ ^k xi μ ^k)⊺ (x) = ( (x − (x − ) − log ) hLDA argmin

k

1 2 μ ^k)⊺Σ ^ −1 μ ^k π ^k K = 2

slide-7
SLIDE 7

7

slide-8
SLIDE 8

11

 

slide-9
SLIDE 9

LINEAR DISCRIMINANT ANALYSIS (CT’D) LINEAR DISCRIMINANT ANALYSIS (CT’D)

Generative model rarely accurate Number of parameters to estimate: class priors, means, elements of covariance matrix Works well if Works poorly if without other tricks (dimensionality reduction, structured covariance) Biggest concern: “one should solve the [classification] problem directly and never solve a more general problem as an intermediate step [such as modeling p(xly)].”, Vapnik, 1998 Revisit binary classifier with LDA We no not need to estimate the full joint distribution!

K − 1 Kd d(d + 1)

1 2

N ≫ d N ≪ d (x) = = η1 ϕ(x; , Σ) π1 μ1 ϕ(x; , Σ) + ϕ(x; , Σ) π1 μ1 π0 μ0 1 1 + exp(−( x + b)) w⊺

8

slide-10
SLIDE 10

LOGISTIC REGRESSION LOGISTIC REGRESSION

Assume that is of the form Estimate and from the data directly Plugin the result to obtain The function is called the logistic function The binary logistic classifier is (linear) How do we estimate and ? From LDA analysis: , Direct estimation of from maximum likelihood

η(x)

1 1+exp(−( x+b)) w⊺

w ^ b ^ (x) = η ^

1 1+exp(−( x+ )) w ^ ⊺ b ^

x ↦

1 1+e−x

(x) = 1{ (x) ≥ } = 1{ x + ≥ 0} hLC η ^

1 2

w ^ ⊺ b ^ w ^ b ^ = ( − ) w ^ Σ ^ −1 μ ^1 μ ^0 b = − + log

1 2 μ

^⊺

^ −1μ ^0

1 2 μ

^⊺

^ −1μ ^1

π ^1 π ^0

( , b) w ^

9

slide-11
SLIDE 11

MLE FOR LOGISTIC REGRESSION MLE FOR LOGISTIC REGRESSION

We have a parametric density model for Standard trick: and This allows us to lump the offset and write Given our dataset the likelihood is For with we obtain

(y|x) = (x) pθ η ^ = [1, x ~ x⊺]⊺ θ = [b w⊺]⊺ η(x) = 1 1 + exp(− ) θ⊺x ~ {( , ) x ~i yi }N

i=1

L(θ) ≜ ( | ) ∏N

i=1 Pθ yi x

~i K = 2 Y = {0, 1} L(θ) ≜ η( (1 − η( ) ∏

i=1 N

x ~i)yi x ~i )1−yi ℓ(θ) = ( log η( ) + (1 − ) log(1 − η( ))) ∑

i=1 N

yi x ~i yi x ~i ℓ(θ) = ( − log(1 + )) ∑

i=1 N

yiθ⊺x ~i eθ⊺x

~i

10