This Lecture Classification Machine Learning and Pattern - - PowerPoint PPT Presentation

this lecture classification
SMART_READER_LITE
LIVE PREVIEW

This Lecture Classification Machine Learning and Pattern - - PowerPoint PPT Presentation

This Lecture Classification Machine Learning and Pattern Recognition Now we focus on classfication . Weve already seen the naive Bayes classifier. This time: An alternative classification family, discriminative methods Chris Williams


slide-1
SLIDE 1

Classification

Machine Learning and Pattern Recognition Chris Williams

School of Informatics, University of Edinburgh

October 2014

(All of the slides in this course have been adapted from previous versions by Charles Sutton, Amos Storkey, David Barber.)

1 / 19

This Lecture

◮ Now we focus on classfication. We’ve already seen the naive

Bayes classifier. This time:

◮ An alternative classification family, discriminative methods

◮ Logistic regression (this time) ◮ Neural networks (coming soon)

◮ Pros and cons of generative and discriminative methods ◮ Reading: Murphy ch 8 up to 8.3.1, §8.4 (not all sections),

§8.6.1; Barber 17.4 up to 17.4.1, 17.4.4, 13.2.3

2 / 19

Discriminative Classification

◮ So far, generative methods for classification. These models

look like p(y, x) = p(x|y)p(y). To classify use Bayes’s Rule to get p(y|x).

◮ Generative assumption: classes exist because data from each

class drawn from two different distributions.

◮ Next we will will use a discriminative approach. Now model

p(y|x) directly. This is a conditional approach. Don’t bother modelling p(x).

◮ Probabilistically, each class label is drawn dependent on the

value of x.

◮ Generative: Class → Data. p(x|y) ◮ Discriminative: Data → Class. p(y|x)

3 / 19

Logistic Regression

◮ Conditional Model ◮ Linear Model ◮ For Classification

4 / 19

slide-2
SLIDE 2

Two Class Discrimination

◮ Consider a two class case: y ∈ {0, 1}. ◮ Use a model of the form

p(y = 1|x) = g(x; w)

◮ g must be between 0 and 1. Furthermore the fact that

probabilities sum to one means p(y = 0|x) = 1 − g(x; w)

◮ What should we propose for g?

5 / 19

The logistic trick

◮ We need two things: ◮ A function that returns probabilities (i.e. stays between 0 and

1).

◮ But in regression (any form of regression) we used a function

that returned values between (−∞ and ∞).

◮ Use a simple trick to convert any regression model to a model

  • f class probabilities. Squash it!

◮ The logistic (or sigmoid) function provides a means for this. ◮ g(x) = σ(x) ≡ 1/(1 + exp(−x)). ◮ As x goes from −∞ to ∞, so σ(x) goes from 0 to 1. ◮ Other choices are available, but the logistic is the canonical

link function

6 / 19

The Logistic Function

−6 −4 −2 2 4 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

The Logistic Function σ(x) =

1 1+exp(−x).

7 / 19

That is it! Almost

◮ That is all we need. We can now convert any regression

model to a classification model.

◮ Consider linear regression f(x) = b + wT x. For linear

regression p(y|x) = N(y; f(x), σ2) is Gaussian with mean f.

◮ Change the prediction by adding in the logistic function. This

changes the likelihood...

◮ p(y = 1|x) = σ(f(x)) = σ(b + wT x). ◮ Decision boundary p(y = 1|x) = p(y = 0|x) = 0.5

b + wT x = 0

◮ Linear regression/linear parameter models + logistic trick (use

  • f sigmoid squashing) = logistic regression.

◮ Probability of 1 changes with distance from some hyperplane.

8 / 19

slide-3
SLIDE 3

The Linear Decision Boundary

w

1 1 1 1 1 1 1 1

For two dimensional data the decision boundary is a line.

9 / 19

Logistic regression

◮ The bias parameter b shifts (for constant w) the position of

the hyperplane, but does not alter the angle.

◮ The direction of the vector w affects the angle of the

  • hyperplane. The hyperplane is perpendicular to w.

◮ The magnitude of the vector w effects how certain the

classifications are.

◮ For small w most of the probabilities within a region of the

decision boundary will be near to 0.5.

◮ For large w probabilities in the same region will be close to 1

  • r 0.

10 / 19

Likelihood

◮ Assume data is independent and identically distributed. ◮ For parameters θ, the likelihood is p(D|θ) = N

  • n=1

p(yn|xn) =

N

  • n=1

p(y = 1|xn)yn (1 − p(y = 1|xn))1−yn

◮ Hence the log likelihood is log p(D|θ) = N

  • n=1

yn log p(y = 1|xn) + (1 − yn) log (1 − p(y = 1|xn))

◮ For maximum likelihood we wish to maximise this value w.r.t

the parameters w and b.

11 / 19

Gradients

◮ As before we can calculate the gradients of the log likelihood. ◮ Gradient of logistic function is σ′(x) = σ(x)(1 − σ(x)).

12 / 19

slide-4
SLIDE 4

Gradients

◮ As before we can calculate the gradients of the log likelihood. ◮ Gradient of logistic function is σ′(x) = σ(x)(1 − σ(x)).

∇wL =

N

  • n=1

(yn − σ(b + wT xn))xn (1) ∂L ∂b =

N

  • n=1

(yn − σ(b + wT xn)) (2)

◮ This cannot be solved directly to find the maximum. ◮ Have to revert to an iterative procedure that searches for a

point where gradient is 0.

◮ This optimization problem is in fact convex. ◮ See later lecture.

12 / 19

2D Example

Prior w ∼ N(0, 100I)

−10 −5 5 −8 −6 −4 −2 2 4 6 8 data Log−Likelihood

1 2 3 4

−8 −6 −4 −2 2 4 6 8 −8 −6 −4 −2 2 4 6 8

Data MLE

Log−Unnormalised Posterior −8 −6 −4 −2 2 4 6 8 −8 −6 −4 −2 2 4 6 8

MAP

Figure credit: Murphy Fig 8.5 13 / 19

Bayesian Logistic Regression

◮ Add a prior, e.g. w ∼ N(0, V0) ◮ For linear regression integrals could be done analytically

p(y∗|x∗, D) =

  • p(y∗|x∗, w)p(w|D) dw

◮ For logistic regression integrals are analytically intractable ◮ Use approximations, e.g. Gaussian approximation, Markov

chain Monte Carlo (see later)

14 / 19

2D Example – Bayesian

Prior w ∼ N(0, 100I)

p(y=1|x, wMAP) −8 −6 −4 −2 2 4 6 8 −8 −6 −4 −2 2 4 6 8 −10 −8 −6 −4 −2 2 4 6 8 −8 −6 −4 −2 2 4 6 8 decision boundary for sampled w

wMAP samples from p(w|D)

MC approx of p(y=1|x) −8 −6 −4 −2 2 4 6 8 −8 −6 −4 −2 2 4 6 8

Averaging over samples

Figure credit: Murphy Fig 8.6 15 / 19

slide-5
SLIDE 5

Multi-class targets

◮ We can have categorical targets by using the softmax function

  • n C target classes

◮ Rather than having one set of weights as in binary logistic

regression, we have one set of weights for each class c

◮ Have a separate set of weights wc, bc for each class c. Define

fc(x) = w⊤

c x + bc ◮ Then

p(y = c|x) = softmax(f(x)) = exp(fc(x))

  • c′ exp(fc′(x))

◮ If C = 2, this actually reduces to the logistic regression model

we’ve already seen.

16 / 19

Generalizing to features

◮ Just as with regression, we can replace the x with features

φ(x) in logistic regression too.

◮ Just compute the features ahead of time. ◮ Just as in regression, there is still the curse of dimensionality.

17 / 19

Generative and Discriminative Methods

◮ Easier to fit? Naive Bayes is easy to fit, cf convex

  • ptimization problem for logistic regression

◮ Fit classes separately? In a discriminative model, all

parameters interact

◮ Handle missing features easily? For generative models this

is easily handled (if features are missing at random)

◮ Can handle unlabelled data? For generative models just

model p(x) =

y p(x|y)p(y) ◮ Symmetric in inputs and outputs? Discriminative methods

model p(y|x) directly

◮ Can handle feature preprocessing? x → φ(x). Big

advantage of discriminative methods

◮ Well-calibrated probabilities? Some generative models (e.g.

Naive Bayes) make strong assumptions, can lead to extreme probabilities

18 / 19

Summary

◮ The logistic function. ◮ Logistic regression. ◮ Hyperplane decision boundaries. ◮ The likelihood for logistic regression. ◮ Softmax for multi-class problems ◮ Pros and cons of generative and discriminative methods

19 / 19