CSE 802 Spring 2017 Logistic Regression Inci M. Baytas Computer - - PowerPoint PPT Presentation

cse 802 spring 2017 logistic regression
SMART_READER_LITE
LIVE PREVIEW

CSE 802 Spring 2017 Logistic Regression Inci M. Baytas Computer - - PowerPoint PPT Presentation

CSE 802 Spring 2017 Logistic Regression Inci M. Baytas Computer Science Michigan State University March 29, 2017 1 / 10 Introduction Consider two-class classification problem, the posterior probability of class C 1 can be written as: w T


slide-1
SLIDE 1

CSE 802 Spring 2017 Logistic Regression

Inci M. Baytas Computer Science Michigan State University March 29, 2017

1 / 10

slide-2
SLIDE 2

Introduction

◮ Consider two-class classification problem, the posterior

probability of class C1 can be written as: p (C1|Φ) = y (Φ) = σ

  • wT Φ
  • (1)

◮ σ (·) is the logistic sigmoid function. ◮ p (C2|Φ) = 1 − p (C1|Φ) ◮ Φ is a feature vector, a non-linear transformation on original

  • bservation space x.

◮ The model in Eq.1 is called as Logistic Regression in the

terminology of statistics.

2 / 10

slide-3
SLIDE 3

Logistic Regression I

◮ A classification model rather than regression ◮ A probabilistic discriminative model

◮ We estimate the parameter w directly.

◮ Comparison of logistic regression and generative model in

M-dimensional space Φ:

◮ Logistic regression: M adjustable parameters. ◮ Generative models: Assume we fit Gaussian class conditional

densities using maximum likelihood; M (M + 5) /2 + 1 = Means: 2M + Shared covariance: (M + 1) M/2 + Prior p (C1): 1

◮ Maximum likelihood is used to determine the parameters of

logistic regression model.

3 / 10

slide-4
SLIDE 4

Logistic Regression II

◮ Definition and properties of logistic sigmoid function:

σ (a) = 1 1 + exp (−a) σ (−a) = 1 − σ (a) dσ da = σ (1 − σ) (2)

4 / 10

slide-5
SLIDE 5

Logistic Regression III - How to Estimate w

◮ For a training data set {Φn, tn}, where tn ∈ {0, 1} and

Φn = Φ (xn), with n = 1, ..., N, the log likelihood can be written as: p (t|w) =

N

  • n=1

ytn

n {1 − yn}1−tn

(3) where t = (t1, ..., tN)T and yn = p (C1|Φn)

◮ The error function is the negative logarithm of the likelihood,

known as Cross-entropy error function: E (w) = −lnp (t|w) = −

N

  • n=1

{tnlnyn + (1 − tn) ln (1 − yn)} (4) where yn = σ (an) and an = wT Φn.

5 / 10

slide-6
SLIDE 6

Logistic Regression IV - How to Estimate w

◮ There is no analytical (closed-form) solution. ◮ The cross entropy loss is a convex function.

◮ There is a global minimum. ◮ Can use an iterative approach.

◮ Calculate the gradient with respect to w:

∇E (w) =

N

  • n=1

(yn − tn) Φn (5)

◮ Use gradient descent (batch or online):

wτ+1 = wτ − η∇E (wτ) (6)

6 / 10

slide-7
SLIDE 7

Logistic Regression V - How to Estimate w

◮ Newton-Raphson Algorithm

w(new) = w(old) − H−1∇E (w) (7)

◮ It uses a local quadratic approximation to the cross-entropy

error function to update w iteratively.

◮ Newton-Raphson algorithm is also known as iterative

reweighted least squares.

◮ Convexity: H is positive definite (eigenvalues of H are

non-negative).

7 / 10

slide-8
SLIDE 8

Multi-class Logistic Regression

◮ Cross-entropy for multi-class classification problem:

E (w1, ..., wK) = −

N

  • n=1

K

  • k=1

tnklnynk (8) where yk (Φ) = p (Ck|Φ) = exp(wT

k Φ)

  • j exp(wT

j Φ) which is called

softmax function.

◮ Use maximum likelihood to estimate the parameters. ◮ Use an iterative approach such as Newton-Rapson.

8 / 10

slide-9
SLIDE 9

Over-fitting in Logistic Regression

◮ Maximum likelihood can suffer from severe over-fitting. ◮ This can be overcome by finding a MAP solution for w

(Bayesian treatment).

◮ Another alternative is to use regularization. ◮ Add regularizers to the loss function, regularized

log-likelihood.

◮ ℓ2 norm ◮ ℓ1 norm (Lasso) 9 / 10

slide-10
SLIDE 10

References

◮ Classification lecture of Dr. Jiayu Zhou. ◮ Christopher Bishop, Pattern Recognition and Machine

Learning, Information Science and Statistics, Springer-Verlag New York, 2006.

10 / 10