Logistic Regression (slides borrowed from Tom Mitchell, Barnabs - - PowerPoint PPT Presentation

logistic regression
SMART_READER_LITE
LIVE PREVIEW

Logistic Regression (slides borrowed from Tom Mitchell, Barnabs - - PowerPoint PPT Presentation

CSCI 4520 - Introduction to Machine Learning Spring 2020 Mehdi Allahyari Georgia Southern University Logistic Regression (slides borrowed from Tom Mitchell, Barnabs Pczos & Aarti Singh 1 Linear Regression & Linear Classification


slide-1
SLIDE 1

Spring 2020

Mehdi Allahyari Georgia Southern University

(slides borrowed from Tom Mitchell, Barnabás Póczos & Aarti Singh

1

Logistic Regression

CSCI 4520 - Introduction to Machine Learning

slide-2
SLIDE 2

Linear Regression & Linear Classification

2

Weight% Height%

Linear%fit% Linear%decision%boundary%

slide-3
SLIDE 3

Naïve Bayes Recap…

3

  • NB%Assump$on:%
  • NB%Classifier:%
  • Assume%parametric%form%for%P(Xi|Y)%and%P(Y)%

– Es$mate%parameters%using%MLE/MAP%and%plug%in%

slide-4
SLIDE 4

Generative vs. Discriminative Classifiers

4

7%

Genera$ve%classifiers%(e.g.%Naïve%Bayes)%

%

  • %Assume%some%func$onal%form%for%P(X,Y)%(or%P(X|Y)%and%P(Y))%
  • %Es$mate%parameters%of%P(X|Y),%P(Y)%directly%from%training%data%

% But% %arg%max_Y%P(X|Y)%P(Y)%=%arg%max_Y%P(Y|X)% %

Why%not%learn%P(Y|X)%directly?%Or%beder%yet,%why%not%learn%the% decision%boundary%directly?%

%

Discrimina$ve%classifiers%(e.g.%Logis$c%Regression)%

%

  • %Assume%some%func$onal%form%for%P(Y|X)%or%for%the%decision%boundary%
  • %Es$mate%parameters%of%P(Y|X)%directly%from%training%data%

%

slide-5
SLIDE 5

Logistic Regression

5

Idea:

  • Naïve Bayes allows computing P(Y|X) by

learning P(Y) and P(X|Y)

  • Why not learn P(Y|X) directly?
slide-6
SLIDE 6

GNB with equal variance is a linear classifier

6

  • Consider learning f: X ! Y, where
  • X is a vector of real-valued features, < X1 … Xn >
  • Y is boolean
  • assume all Xi are conditionally independent given Y
  • model P(Xi | Y = yk) as Gaussian N(µik,σi)
  • model P(Y) as Bernoulli (π)
  • What does that imply about the form of P(Y|X)?
slide-7
SLIDE 7

7

Derive form for P(Y|X) for Gaussian P(Xi|Y=yk) assuming σik = σi

slide-8
SLIDE 8

8

implies implies implies

slide-9
SLIDE 9

9

implies implies implies

linear classification rule!

slide-10
SLIDE 10

Logistic Function

10

slide-11
SLIDE 11

Logistic regression more generally

11

  • Logistic regression when Y not boolean (but

still discrete-valued).

  • Now y ∈ {y1 ... yR} : learn R-1 sets of weights

for k<R for k=R

slide-12
SLIDE 12

Training Logistic Regression: MCLE

12

We’ll%focus%on%binary%classifica$on:%

%

But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X)

  • we have L training examples:
  • maximum likelihood estimate for parameters W
  • maximum conditional likelihood estimate
slide-13
SLIDE 13

Training Logistic Regression: MCLE

13

  • we have L training examples:
  • maximum likelihood estimate for parameters W
  • maximum conditional likelihood estimate

We’ll%focus%on%binary%classifica$on:%

%

slide-14
SLIDE 14

Training Logistic Regression: MCLE

14

  • Choose parameters W=<w0, ... wn> to

maximize conditional likelihood of training data

  • Training data D =
  • Data likelihood =
  • Data conditional likelihood =

where

slide-15
SLIDE 15

Expressing Conditional Log Likelihood

15

slide-16
SLIDE 16

Maximizing Conditional Log Likelihood

16

Good news: l(w) is a concave function of w →no locally optimal solutions! Bad news: no closed-form solution to maximize l(w) Good news: concave functions “easy” to optimize

slide-17
SLIDE 17

Optimizing concave/convex functions

17

  • Condi$onal%likelihood%for%Logis$c%Regression%is%concave%
  • Maximum%of%a%concave%func$on%=%minimum%of%a%convex%func$on%

Gradient(Ascent((concave)/(Gradient(Descent((convex)(

Gradient:( Learning(rate,(η>0( Update(rule:(

slide-18
SLIDE 18

18

Batch gradient: use error over entire training set D

Do until satisfied:

  • 1. Compute the gradient
  • 2. Update the vector of parameters:

Stochastic gradient: use error over single examples

Do until satisfied:

  • 1. Choose (with replacement) a random training example
  • 2. Compute the gradient just for :
  • 3. Update the vector of parameters:

Stochastic approximates Batch arbitrarily closely as Stochastic can be much faster when D is very large Intermediate approach: use error over subsets of D

slide-19
SLIDE 19

Maximize Conditional Log Likelihood: Gradient Ascent

19

slide-20
SLIDE 20

Maximize Conditional Log Likelihood: Gradient Ascent

20

Gradient ascent algorithm: iterate until change < ε For all i, repeat

slide-21
SLIDE 21

Effect of step size η

21

Large%η%%=>%Fast%convergence%but%larger%residual%error% %%%%%%%%Also%possible%oscilla$ons% % Small%η%%=>%Slow%convergence%but%small%residual%error% %%%%

slide-22
SLIDE 22

That’s all for M(C)LE. How about MAP?

22

  • One common approach is to define priors on W

– Normal distribution, zero mean, identity covariance

  • Helps avoid very large weights and overfitting
  • MAP estimate
  • let’s assume Gaussian prior: W ~ N(0, σ)
slide-23
SLIDE 23

MLE vs. MAP

23

  • Maximum conditional likelihood estimate
  • Maximum a posteriori estimate with prior W~N(0,σI)
slide-24
SLIDE 24

MAP estimates and Regularization

24

  • Maximum a posteriori estimate with prior W~N(0,σI)

called a “regularization” term

  • helps reduce overfitting
  • keep weights nearer to zero (if P(W) is zero mean

Gaussian prior), or whatever the prior suggests

  • used very frequently in Logistic Regression
slide-25
SLIDE 25

The Bottom Line

25

  • Consider learning f: X ! Y, where
  • X is a vector of real-valued features, < X1 … Xn >
  • Y is boolean
  • assume all Xi are conditionally independent given Y
  • model P(Xi | Y = yk) as Gaussian N(µik,σi)
  • model P(Y) as Bernoulli (π)
  • Then P(Y|X) is of this form, and we can directly estimate W
  • Furthermore, same holds if the Xi are boolean
  • trying proving that to yourself
slide-26
SLIDE 26

Generative vs. Discriminative Classifiers

26

Training classifiers involves estimating f: X ! Y, or P(Y|X) Generative classifiers (e.g., Naïve Bayes)

  • Assume some functional form for P(X|Y), P(X)
  • Estimate parameters of P(X|Y), P(X) directly from training data
  • Use Bayes rule to calculate P(Y|X= xi)

Discriminative classifiers (e.g., Logistic regression)

  • Assume some functional form for P(Y|X)
  • Estimate parameters of P(Y|X) directly from training data
slide-27
SLIDE 27

Use Naïve Bayes or Logistic Regression?

27

Consider

  • Restrictiveness of modeling assumptions
  • Rate of convergence (in amount of

training data) toward asymptotic hypothesis

slide-28
SLIDE 28

Use Naïve Bayes or Logistic Regression?

28

Consider Y boolean, Xi continuous, X=<X1 ... Xn> Number of parameters to estimate:

  • NB:
  • LR:
slide-29
SLIDE 29

Use Naïve Bayes or Logistic Regression?

29

Consider Y boolean, Xi continuous, X=<X1 ... Xn> Number of parameters:

  • NB: 4n +1
  • LR: n+1

Estimation method:

  • NB parameter estimates are uncoupled
  • LR parameter estimates are coupled
slide-30
SLIDE 30

G.Naïve Bayes vs. Logistic Regression

30

Recall two assumptions deriving form of LR from GNBayes:

  • 1. Xi conditionally independent of Xk given Y
  • 2. P(Xi | Y = yk) = N(µik,σi), " not N(µik,σik)

Consider three learning methods:

  • GNB (assumption 1 only)
  • GNB2 (assumption 1 and 2)
  • LR

Which method works better if we have infinite training data, and…

  • Both (1) and (2) are satisfied
  • Neither (1) nor (2) is satisfied
  • (1) is satisfied, but not (2)
slide-31
SLIDE 31

G.Naïve Bayes vs. Logistic Regression

31

G.Naïve Bayes vs. Logistic Regression

Recall two assumptions deriving form of LR from GNBayes:

  • 1. Xi conditionally independent of Xk given Y
  • 2. P(Xi | Y = yk) = N(µik,σi), " not N(µik,σik)

Consider three learning methods:

  • GNB (assumption 1 only) -- decision surface can be non-linear
  • GNB2 (assumption 1 and 2) – decision surface linear
  • LR -- decision surface linear, trained without

assumption 1.

Which method works better if we have infinite training data, and...

  • Both (1) and (2) are satisfied: LR = GNB2 = GNB
  • (1) is satisfied, but not (2) : GNB > GNB2, GNB > LR, LR > GNB2
  • Neither (1) nor (2) is satisfied: GNB>GNB2, LR > GNB2, LR><GNB

[Ng & Jordan, 2002]

slide-32
SLIDE 32

G.Naïve Bayes vs. Logistic Regression

32

G.Naïve Bayes vs. Logistic Regression

What if we have only finite training data? They converge at different rates to their asymptotic (∞ data) error Let refer to expected error of learning algorithm A after n training examples Let d be the number of features: <X1 … Xd> So, GNB requires n = O(log d) to converge, but LR requires n = O(d)

[Ng & Jordan, 2002]

slide-33
SLIDE 33

33

Some experiments from UCI data sets

[Ng & Jordan, 2002]

Naïve bayes Logistic Regression

slide-34
SLIDE 34

Naïve Bayes vs. Logistic Regression

34

The bottom line: GNB2 and LR both use linear decision surfaces, GNB need not Given infinite data, LR is better or equal to GNB2 because training procedure does not make assumptions 1 or 2 (though our derivation of the form of P(Y|X) did). But GNB2 converges more quickly to its perhaps-less-accurate asymptotic error And GNB is both more biased (assumption1) and less (no assumption 2) than LR, so either might outperform the other

slide-35
SLIDE 35

What you should know:

35

  • Logistic regression

– Functional form follows from Naïve Bayes assumptions

  • For Gaussian Naïve Bayes assuming variance σi,k = σi
  • For discrete-valued Naïve Bayes too

– But training procedure picks parameters without making conditional independence assumption – MLE training: pick W to maximize P(Y | X, W) – MAP training: pick W to maximize P(W | X,Y)

  • ‘regularization’
  • helps reduce overfitting
  • Gradient ascent/descent

– General approach when closed-form solutions unavailable

  • Generative vs. Discriminative classifiers

– Bias vs. variance tradeoff

slide-36
SLIDE 36

What you should know:

36

  • Gaussian Naïve Bayes with class-independent variances

representationally equivalent to LR

– Solution differs because of objective (loss) function

  • In general, NB and LR make different assumptions

– NB: Features independent given class ! assumption on P(X|Y) – LR: Functional form of P(Y|X), no assumption on P(X|Y)

  • LR is a linear classifier

– decision rule is a hyperplane

  • LR optimized by conditional likelihood

– no closed-form solution – concave ! global optimum with gradient ascent – Maximum conditional a posteriori corresponds to regularization

  • Convergence rates

– GNB (usually) needs less data – LR (usually) gets to better solutions in the limit