Logistic Regression, Generative and Discriminative Classifiers - - PowerPoint PPT Presentation

logistic regression generative and discriminative
SMART_READER_LITE
LIVE PREVIEW

Logistic Regression, Generative and Discriminative Classifiers - - PowerPoint PPT Presentation

Logistic Regression, Generative and Discriminative Classifiers Recommended reading: Ng and Jordan paper On Discriminative vs. Generative classifiers: A comparison of logistic regression and nave Bayes, A. Ng and M. Jordan, NIPS


slide-1
SLIDE 1

Logistic Regression, Generative and Discriminative Classifiers

Recommended reading:

  • Ng and Jordan paper “On Discriminative vs. Generative classifiers: A

comparison of logistic regression and naïve Bayes,” A. Ng and M. Jordan, NIPS 2002. Machine Learning 10-701 Tom M. Mitchell Carnegie Mellon University Thanks to Ziv Bar-Joseph, Andrew Moore for some slides

slide-2
SLIDE 2

Overview

Last lecture:

  • Naïve Bayes classifier
  • Number of parameters to estimate
  • Conditional independence

This lecture:

  • Logistic regression
  • Generative and discriminative classifiers
  • (if time) Bias and variance in learning
slide-3
SLIDE 3
slide-4
SLIDE 4

Generative vs. Discriminative Classifiers

Training classifiers involves estimating f: X Y, or P(Y|X) Generative classifiers:

  • Assume some functional form for P(X|Y), P(X)
  • Estimate parameters of P(X|Y), P(X) directly from training data
  • Use Bayes rule to calculate P(Y|X= xi)

Discriminative classifiers: 1. Assume some functional form for P(Y|X) 2. Estimate parameters of P(Y|X) directly from training data

slide-5
SLIDE 5
  • Consider learning f: X Y, where
  • X is a vector of real-valued features, < X1 … Xn >
  • Y is boolean
  • So we use a Gaussian Naïve Bayes classifier
  • assume all Xi are conditionally independent given Y
  • model P(Xi | Y = yk) as Gaussian N(µik,σ)
  • model P(Y) as binomial (p)
  • What does that imply about the form of P(Y|X)?
slide-6
SLIDE 6
  • Consider learning f: X Y, where
  • X is a vector of real-valued features, < X1 … Xn >
  • Y is boolean
  • assume all Xi are conditionally independent given Y
  • model P(Xi | Y = yk) as Gaussian N(µik,σ)
  • model P(Y) as binomial (p)
  • What does that imply about the form of P(Y|X)?
slide-7
SLIDE 7

Logistic regression

  • Logistic regression represents the probability
  • f category i using a linear function of the

input variables: where for i<k and for k

) ( ) | (

1 1 d id i i

x w x w w g x X i Y P + + + = = = K

− =

+ =

1 1

1 ) (

K j z z i

j i

e e z g

− =

+ =

1 1

1 1 ) (

K j z k

j

e z g

slide-8
SLIDE 8

Logistic regression

  • The name comes from the logit transformation:

d id i k i

x w x w w z g z g x X K Y p x X i Y p + + + = = = = = = K

1 1

) ( ) ( log ) | ( ) | ( log

slide-9
SLIDE 9

Binary logistic regression

  • We only need one set of parameters
  • This results in a “squashing function” which

turns linear predictions into probabilities

z x w x w w x w x w w x w x w w

e e e e x X Y p

d d d d d d

− + + + − + + + + + +

+ = + = + = = = 1 1 1 1 1 ) | 1 (

) (

1 1 1 1 1 1

K K K

slide-10
SLIDE 10

Logistic regression vs. Linear regression

z

e x X Y P

+ = = = 1 1 ) | 1 (

slide-11
SLIDE 11

Example

slide-12
SLIDE 12

Log likelihood

∑ ∑ ∑

= = =

+ − = + + − = − − + =

N i w x i i N i w x i i i N i i i i i

i i

e w x y e w x p w x p y w x p y w x p y w l

1 1 1

) 1 log( ) 1 1 log( ) ; ( 1 ( ) ; ( log )) ; ( 1 log( ) 1 ( ) ; ( log ) (

slide-13
SLIDE 13

Log likelihood

∑ ∑ ∑

= = =

+ − = + + − = − − + =

N i w x i i N i w x i i i N i i i i i

i i

e w x y e w x p w x p y w x p y w x p y w l

1 1 1

) 1 log( ) 1 1 log( ) ; ( 1 ( ) ; ( log )) ; ( 1 log( ) 1 ( ) ; ( log ) (

  • Note: this likelihood is a concave in w
slide-14
SLIDE 14

Maximum likelihood estimation ∑ ∑

= =

− = = + − ∂ ∂ = ∂ ∂

N i i i ij N i w x i i j j

w x p y x e w x y w w l w

i

1 1

)) , ( ( )} 1 log( { ) ( K

prediction error

Common (but not only) approaches: Numerical Solutions:

  • Line Search
  • Simulated Annealing
  • Gradient Descent
  • Newton’s Method
  • Matlab glmfit function

No close form solution!

slide-15
SLIDE 15

Gradient descent

slide-16
SLIDE 16

Gradient ascent

)) , ( ( (

1

w x p y x w w

i i ij i t j t j

− + ←

+

ε

  • Iteratively updating the weights in this fashion increases

likelihood each round.

  • We eventually reach the maximum
  • We are near the maximum when changes in the weights

are small.

  • Thus, we can stop when the sum of the absolute values
  • f the weight differences is less than some small number.
slide-17
SLIDE 17

Example

  • We get a monotonically increasing log likelihood of

the training labels as a function of the iterations

slide-18
SLIDE 18

Convergence

  • The gradient ascent learning method

converges when there is no incentive to move the parameters in any particular direction:

k w x p y x

i i ij i

∀ = −

)) , ( ( (

  • This condition means that the

prediction error is uncorrelated with the components of the input vector

slide-19
SLIDE 19

Naïve Bayes vs. Logistic Regression

  • Generative and Discriminative classifiers
  • Asymptotic comparison (# training examples infinity)
  • when model correct
  • when model incorrect
  • Non-asymptotic analysis
  • convergence rate of parameter estimates
  • convergence rate of expected error
  • Experimental results

[Ng & Jordan, 2002]

slide-20
SLIDE 20

Generative-Discriminative Pairs

Example: assume Y boolean, X = <X1, X2, …, Xn>, where xi are boolean, perhaps dependent on Y, conditionally independent given Y Generative model: naïve Bayes: Classify new example x based on ratio Equivalently, based on sign of log of this ratio s indicates size

  • f set.

l is smoothing parameter

slide-21
SLIDE 21

Generative-Discriminative Pairs

Example: assume Y boolean, X = <x1, x2, …, xn>, where xi are boolean, perhaps dependent on Y, conditionally independent given Y Generative model: naïve Bayes: Classify new example x based on ratio Discriminative model: logistic regression Note both learn linear decision surface over X in this case

slide-22
SLIDE 22

What is the difference asymptotically?

Notation: let denote error of hypothesis learned via algorithm A, from m examples

  • If assumed model correct (e.g., naïve Bayes model), and finite

number of parameters, then

  • If assumed model incorrect

Note assumed discriminative model can be correct even when generative model incorrect, but not vice versa

slide-23
SLIDE 23

Rate of covergence: logistic regression

Let hDis,m be logistic regression trained on m examples in n

  • dimensions. Then with high probability:

Implication: if we want for some constant , it suffices to pick Convergences to its classifier, in order of n examples (result follows from Vapnik’s structural risk bound, plus fact that VCDim of n dimensional linear separators is n )

slide-24
SLIDE 24

Rate of covergence: naïve Bayes

Consider first how quickly parameter estimates converge toward their asymptotic values. Then we’ll ask how this influences rate of convergence toward asymptotic classification error.

slide-25
SLIDE 25

Rate of covergence: naïve Bayes parameters

slide-26
SLIDE 26

Some experiments from UCI data sets

slide-27
SLIDE 27

What you should know:

  • Logistic regression

– What it is – How to solve it – Log linear models

  • Generative and Discriminative classifiers

– Relation between Naïve Bayes and logistic regression – Which do we prefer, when?

  • Bias and variance in learning algorithms
slide-28
SLIDE 28

Acknowledgment

Some of these slides are based in part on slides from previous machine learning classes taught by Ziv Bar-Joseph, Andrew Moore at CMU, and by Tommi Jaakkola at MIT. I thank them for providing use of their slides.