Linear Models for Classification Oliver Schulte - CMPT 726 Bishop - - PowerPoint PPT Presentation

linear models for classification
SMART_READER_LITE
LIVE PREVIEW

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop - - PowerPoint PPT Presentation

Discriminant Functions Generative Models Discriminative Models Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Discriminant Functions Generative Models Discriminative Models Classification: Hand-written Digit


slide-1
SLIDE 1

Discriminant Functions Generative Models Discriminative Models

Linear Models for Classification

Oliver Schulte - CMPT 726 Bishop PRML Ch. 4

slide-2
SLIDE 2

Discriminant Functions Generative Models Discriminative Models

Classification: Hand-written Digit Recognition

xi = ti = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0)

  • Each input vector classified into one of K discrete classes
  • Denote classes by Ck
  • Represent input image as a vector xi ∈ R784.
  • We have target vector ti ∈ {0, 1}10
  • Given a training set {(x1, t1), . . . , (xN, tN)}, learning problem

is to construct a “good” function y(x) from these.

  • y : R784 → R10
slide-3
SLIDE 3

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

  • Similar to previous chapter on linear models for regression,

we will use a “linear” model for classification: y(x) = f(wTx + w0)

  • This is called a generalized linear model
  • f(·) is a fixed non-linear function
  • e.g.

f(u) = 1 if u ≥ 0 0 otherwise

  • Decision boundary between classes will be linear function
  • f x
  • Can also apply non-linearity to x, as in φi(x) for regression
slide-4
SLIDE 4

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

  • Similar to previous chapter on linear models for regression,

we will use a “linear” model for classification: y(x) = f(wTx + w0)

  • This is called a generalized linear model
  • f(·) is a fixed non-linear function
  • e.g.

f(u) =

  • 1 if u ≥ 0

0 otherwise

  • Decision boundary between classes will be linear function
  • f x
  • Can also apply non-linearity to x, as in φi(x) for regression
slide-5
SLIDE 5

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

  • Similar to previous chapter on linear models for regression,

we will use a “linear” model for classification: y(x) = f(wTx + w0)

  • This is called a generalized linear model
  • f(·) is a fixed non-linear function
  • e.g.

f(u) =

  • 1 if u ≥ 0

0 otherwise

  • Decision boundary between classes will be linear function
  • f x
  • Can also apply non-linearity to x, as in φi(x) for regression
slide-6
SLIDE 6

Discriminant Functions Generative Models Discriminative Models

Overview

  • Linear regression for Classification
  • The Fisher Linear Discriminant, or How to Draw a Line

Between Classes

  • The Perceptron, or The Smallest Neural Net
  • Logistic Regression—The Statistician’s Classifier
slide-7
SLIDE 7

Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions Generative Models Discriminative Models

slide-8
SLIDE 8

Discriminant Functions Generative Models Discriminative Models

Discriminant Functions with Two Classes

x2 x1 w x

y(x) w

x⊥

−w0 w

y = 0 y < 0 y > 0 R2 R1

  • Start with 2 class problem,

ti ∈ {0, 1}

  • Simple linear discriminant

y(x) = wTx + w0 apply threshold function to get classification

  • Decision surface is line;
  • rthogonal to w.
  • Projection of x in w dir. is wTx

||w||

slide-9
SLIDE 9

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

  • A linear discriminant between two classes separates with a

hyperplane

  • How to use this for multiple classes?
  • One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others

  • One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs

slide-10
SLIDE 10

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

  • A linear discriminant between two classes separates with a

hyperplane

  • How to use this for multiple classes?
  • One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others

  • One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs

slide-11
SLIDE 11

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

  • A linear discriminant between two classes separates with a

hyperplane

  • How to use this for multiple classes?
  • One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others

  • One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs

slide-12
SLIDE 12

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1 R2 R3 ? C1 not C1 C2 not C2

  • A linear discriminant between two classes separates with a

hyperplane

  • How to use this for multiple classes?
  • One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others

  • One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs

slide-13
SLIDE 13

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1 R2 R3 ? C1 not C1 C2 not C2

  • A linear discriminant between two classes separates with a

hyperplane

  • How to use this for multiple classes?
  • One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others

  • One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs

slide-14
SLIDE 14

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

R1 R2 R3 ? C1 not C1 C2 not C2 R1 R2 R3 ? C1 C2 C1 C3 C2 C3

  • A linear discriminant between two classes separates with a

hyperplane

  • How to use this for multiple classes?
  • One-versus-the-rest method: build K − 1 classifiers,

between Ck and all others

  • One-versus-one method: build K(K − 1)/2 classifiers,

between all pairs

slide-15
SLIDE 15

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

Ri Rj Rk xA xB ˆ x

  • A solution is to build K linear functions:

yk(x) = wT

k x + wk0

assign x to class maxk yk(x)

  • Gives connected, convex decision regions

ˆ x = λxA + (1 − λ)xB yk(ˆ x) = λyk(xA) + (1 − λ)yk(xB) ⇒ yk(ˆ x) > yj(ˆ x), ∀j = k

slide-16
SLIDE 16

Discriminant Functions Generative Models Discriminative Models

Multiple Classes

Ri Rj Rk xA xB ˆ x

  • A solution is to build K linear functions:

yk(x) = wT

k x + wk0

assign x to class maxk yk(x)

  • Gives connected, convex decision regions

ˆ x = λxA + (1 − λ)xB yk(ˆ x) = λyk(xA) + (1 − λ)yk(xB) ⇒ yk(ˆ x) > yj(ˆ x), ∀j = k

slide-17
SLIDE 17

Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

  • How do we learn the decision boundaries (wk, wk0)?
  • One approach is to use least squares, similar to regression
  • Find W to minimize squared error over all examples and all

components of the label vector: E(W) = 1 2

N

  • n=1

K

  • k=1

(yk(xn) − tnk)2

  • Some algebra, we get a solution using the pseudo-inverse

as in regression

slide-18
SLIDE 18

Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

  • How do we learn the decision boundaries (wk, wk0)?
  • One approach is to use least squares, similar to regression
  • Find W to minimize squared error over all examples and all

components of the label vector: E(W) = 1 2

N

  • n=1

K

  • k=1

(yk(xn) − tnk)2

  • Some algebra, we get a solution using the pseudo-inverse

as in regression

slide-19
SLIDE 19

Discriminant Functions Generative Models Discriminative Models

Least Squares for Classification

  • How do we learn the decision boundaries (wk, wk0)?
  • One approach is to use least squares, similar to regression
  • Find W to minimize squared error over all examples and all

components of the label vector: E(W) = 1 2

N

  • n=1

K

  • k=1

(yk(xn) − tnk)2

  • Some algebra, we get a solution using the pseudo-inverse

as in regression

slide-20
SLIDE 20

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 2 4 6 8 −8 −6 −4 −2 2 4

  • Looks okay... least squares

decision boundary

  • Similar to logistic regression

decision boundary (more later)

slide-21
SLIDE 21

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 2 4 6 8 −8 −6 −4 −2 2 4 −4 −2 2 4 6 8 −8 −6 −4 −2 2 4

  • Looks okay... least squares

decision boundary

  • Similar to logistic regression

decision boundary (more later)

  • Gets worse by adding easy

points?!

slide-22
SLIDE 22

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 2 4 6 8 −8 −6 −4 −2 2 4 −4 −2 2 4 6 8 −8 −6 −4 −2 2 4

  • Looks okay... least squares

decision boundary

  • Similar to logistic regression

decision boundary (more later)

  • Gets worse by adding easy

points?!

  • Why?
slide-23
SLIDE 23

Discriminant Functions Generative Models Discriminative Models

Problems with Least Squares

−4 −2 2 4 6 8 −8 −6 −4 −2 2 4 −4 −2 2 4 6 8 −8 −6 −4 −2 2 4

  • Looks okay... least squares

decision boundary

  • Similar to logistic regression

decision boundary (more later)

  • Gets worse by adding easy

points?!

  • Why?
  • If target value is 1, points far

from boundary will have high value, say 10; this is a large error so the boundary is moved

slide-24
SLIDE 24

Discriminant Functions Generative Models Discriminative Models

More Least Squares Problems

−6 −4 −2 2 4 6 −6 −4 −2 2 4 6

  • Easily separated by hyperplanes, but not found using least

squares!

  • Remember that least squares is MLE under Gaussian noise

model for continuous target - we’ve got discrete targets.

  • We’ll address these problems later with better models
  • First, a look at a different criterion for linear discriminant
slide-25
SLIDE 25

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

  • The two-class linear discriminant acts as a projection:

y = wTx

slide-26
SLIDE 26

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

  • The two-class linear discriminant acts as a projection:

y = wTx ≥ −w0 followed by a threshold

slide-27
SLIDE 27

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

  • The two-class linear discriminant acts as a projection:

y = wTx ≥ −w0 followed by a threshold

  • In which direction w should we project?
slide-28
SLIDE 28

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

  • The two-class linear discriminant acts as a projection:

y = wTx ≥ −w0 followed by a threshold

  • In which direction w should we project?
  • One which separates classes “well”
  • Intuition: We want the (projected) centers of the classes to

be far apart, and each class (projection) to be clustered around its centre.

slide-29
SLIDE 29

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

  • A natural idea would be to project in the direction of the

line connecting class means

slide-30
SLIDE 30

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2 2 6 −2 2 4

  • A natural idea would be to project in the direction of the

line connecting class means

slide-31
SLIDE 31

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2 2 6 −2 2 4

  • A natural idea would be to project in the direction of the

line connecting class means

  • However, problematic if classes have variance in this

direction (ie are not clustered around the mean)

slide-32
SLIDE 32

Discriminant Functions Generative Models Discriminative Models

Fisher’s Linear Discriminant

−2 2 6 −2 2 4 −2 2 6 −2 2 4

  • A natural idea would be to project in the direction of the

line connecting class means

  • However, problematic if classes have variance in this

direction (ie are not clustered around the mean)

  • Fisher criterion: maximize ratio of inter-class separation

(between) to intra-class variance (inside)

slide-33
SLIDE 33

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

  • Projection yn = wTxn
  • Inter-class separation is distance between class means

(good): mk = 1 Nk

  • n∈Ck

wTxn

  • Intra-class variance (bad):

s2

k =

  • n∈Ck

(yn − mk)2

  • Fisher criterion:

J(w) = (m2 − m1)2 s2

1 + s2 2

maximize wrt w

slide-34
SLIDE 34

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

  • Projection yn = wTxn
  • Inter-class separation is distance between class means

(good): mk = 1 Nk

  • n∈Ck

wTxn

  • Intra-class variance (bad):

s2

k =

  • n∈Ck

(yn − mk)2

  • Fisher criterion:

J(w) = (m2 − m1)2 s2

1 + s2 2

maximize wrt w

slide-35
SLIDE 35

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

  • Projection yn = wTxn
  • Inter-class separation is distance between class means

(good): mk = 1 Nk

  • n∈Ck

wTxn

  • Intra-class variance (bad):

s2

k =

  • n∈Ck

(yn − mk)2

  • Fisher criterion:

J(w) = (m2 − m1)2 s2

1 + s2 2

maximize wrt w

slide-36
SLIDE 36

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

  • Projection yn = wTxn
  • Inter-class separation is distance between class means

(good): mk = 1 Nk

  • n∈Ck

wTxn

  • Intra-class variance (bad):

s2

k =

  • n∈Ck

(yn − mk)2

  • Fisher criterion:

J(w) = (m2 − m1)2 s2

1 + s2 2

maximize wrt w

slide-37
SLIDE 37

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

J(w) = (m2 − m1)2 s2

1 + s2 2

= wTSBw wTSWw Between-class covariance: SB = (m2 − m1)(m2 − m1)T Within-class covariance: SW =

  • n∈C1

(xn − m1)(xn − m1)T +

  • n∈C2

(xn − m2)(xn − m2)T Lots of math: w ∝ S−1

W (m2 − m1)

If covariance SW is isotropic (proportional to unit matrix, so little variance within class), reduces to class mean difference vector

slide-38
SLIDE 38

Discriminant Functions Generative Models Discriminative Models

Math time - FLD

J(w) = (m2 − m1)2 s2

1 + s2 2

= wTSBw wTSWw Between-class covariance: SB = (m2 − m1)(m2 − m1)T Within-class covariance: SW =

  • n∈C1

(xn − m1)(xn − m1)T +

  • n∈C2

(xn − m2)(xn − m2)T Lots of math: w ∝ S−1

W (m2 − m1)

If covariance SW is isotropic (proportional to unit matrix, so little variance within class), reduces to class mean difference vector

slide-39
SLIDE 39

Discriminant Functions Generative Models Discriminative Models

FLD Summary

  • FLD is a dimensionality reduction technique (more later in

the course)

  • Criterion for choosing projection based on class labels
  • Still suffers from outliers (e.g. earlier least squares

example)

slide-40
SLIDE 40

Discriminant Functions Generative Models Discriminative Models

Perceptrons

  • Perceptrons is used to refer to many neural network

structures (more next week)

  • The classic type is a fixed non-linear transformation of

input, one layer of adaptive weights, and a threshold: y(x) = f(wTφ(x))

  • Developed by Rosenblatt in the 50s
  • The main difference compared to the methods we’ve seen

so far is the learning algorithm

slide-41
SLIDE 41

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

  • Two class problem
  • For ease of notation, we will use t = 1 for class C1 and

t = −1 for class C2

  • We saw that squared error was problematic
  • Instead, we’d like to minimize the number of misclassified

examples

  • An example is mis-classified if wTφ(xn)tn < 0
  • Perceptron criterion:

EP(w) = −

  • n∈M

wTφ(xn)tn sum over mis-classified examples only

slide-42
SLIDE 42

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

  • Two class problem
  • For ease of notation, we will use t = 1 for class C1 and

t = −1 for class C2

  • We saw that squared error was problematic
  • Instead, we’d like to minimize the number of misclassified

examples

  • An example is mis-classified if wTφ(xn)tn < 0
  • Perceptron criterion:

EP(w) = −

  • n∈M

wTφ(xn)tn sum over mis-classified examples only

slide-43
SLIDE 43

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

  • Two class problem
  • For ease of notation, we will use t = 1 for class C1 and

t = −1 for class C2

  • We saw that squared error was problematic
  • Instead, we’d like to minimize the number of misclassified

examples

  • An example is mis-classified if wTφ(xn)tn < 0
  • Perceptron criterion:

EP(w) = −

  • n∈M

wTφ(xn)tn sum over mis-classified examples only

slide-44
SLIDE 44

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning

  • Two class problem
  • For ease of notation, we will use t = 1 for class C1 and

t = −1 for class C2

  • We saw that squared error was problematic
  • Instead, we’d like to minimize the number of misclassified

examples

  • An example is mis-classified if wTφ(xn)tn < 0
  • Perceptron criterion:

EP(w) = −

  • n∈M

wTφ(xn)tn sum over mis-classified examples only

slide-45
SLIDE 45

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

  • Minimize the error function using stochastic gradient

descent (gradient descent per example): w(τ+1) = w(τ) − η∇EP(w)

slide-46
SLIDE 46

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

  • Minimize the error function using stochastic gradient

descent (gradient descent per example): w(τ+1) = w(τ) − η∇EP(w) = w(τ) + ηφ(xn)tn

  • if incorrect
  • Iterate over all training examples, only change w if the

example is mis-classified

slide-47
SLIDE 47

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Algorithm

  • Minimize the error function using stochastic gradient

descent (gradient descent per example): w(τ+1) = w(τ) − η∇EP(w) = w(τ) + ηφ(xn)tn

  • if incorrect
  • Iterate over all training examples, only change w if the

example is mis-classified

  • Guaranteed to converge if data are linearly separable
  • Will not converge if not
  • May take many iterations
  • Sensitive to initialization
slide-48
SLIDE 48

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0.5 1 −1 −0.5 0.5 1

slide-49
SLIDE 49

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1

slide-50
SLIDE 50

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1

slide-51
SLIDE 51

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1

slide-52
SLIDE 52

Discriminant Functions Generative Models Discriminative Models

Perceptron Learning Illustration

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1

  • Note there are many hyperplanes with 0 error
  • Support vector machines (in a few weeks) have a nice way
  • f choosing one
slide-53
SLIDE 53

Discriminant Functions Generative Models Discriminative Models

Limitations of Perceptrons

  • Perceptrons can only solve linearly separable problems in

feature space

  • Same as the other models in this chapter
  • Canonical example of non-separable problem is X-OR
  • Real datasets can look like this too

I

1

I

2

?

1 1

slide-54
SLIDE 54

Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions Generative Models Discriminative Models

slide-55
SLIDE 55

Discriminant Functions Generative Models Discriminative Models

Intuition for Logistic Regression in Generative Models

  • Classification with Joint Probabilities With 2 classes, C1

and C2:

  • Choose C1 if

1 < p(C1|x) p(C2|x) = p(C1, x) p(C2, x) ⇐ ⇒ 0 < ln p(C1|x) p(C2|x)

  • = ln(p(C1|x)) − ln(p(C2|x))
  • The quantity

a = ln p(C1|x) p(C2|x)

  • is called the log-odds.
  • Logistic Regression assumes that the log-odds are a linear

function of the feature vector: a = wTx + w0.

  • Often true given assumptions about the class-conditional

densities p(x|Ck).

slide-56
SLIDE 56

Discriminant Functions Generative Models Discriminative Models

Intuition for Logistic Regression in Generative Models

  • Classification with Joint Probabilities With 2 classes, C1

and C2:

  • Choose C1 if

1 < p(C1|x) p(C2|x) = p(C1, x) p(C2, x) ⇐ ⇒ 0 < ln p(C1|x) p(C2|x)

  • = ln(p(C1|x)) − ln(p(C2|x))
  • The quantity

a = ln p(C1|x) p(C2|x)

  • is called the log-odds.
  • Logistic Regression assumes that the log-odds are a linear

function of the feature vector: a = wTx + w0.

  • Often true given assumptions about the class-conditional

densities p(x|Ck).

slide-57
SLIDE 57

Discriminant Functions Generative Models Discriminative Models

Intuition for Logistic Regression in Generative Models

  • Classification with Joint Probabilities With 2 classes, C1

and C2:

  • Choose C1 if

1 < p(C1|x) p(C2|x) = p(C1, x) p(C2, x) ⇐ ⇒ 0 < ln p(C1|x) p(C2|x)

  • = ln(p(C1|x)) − ln(p(C2|x))
  • The quantity

a = ln p(C1|x) p(C2|x)

  • is called the log-odds.
  • Logistic Regression assumes that the log-odds are a linear

function of the feature vector: a = wTx + w0.

  • Often true given assumptions about the class-conditional

densities p(x|Ck).

slide-58
SLIDE 58

Discriminant Functions Generative Models Discriminative Models

Intuition for Logistic Regression in Generative Models

  • Classification with Joint Probabilities With 2 classes, C1

and C2:

  • Choose C1 if

1 < p(C1|x) p(C2|x) = p(C1, x) p(C2, x) ⇐ ⇒ 0 < ln p(C1|x) p(C2|x)

  • = ln(p(C1|x)) − ln(p(C2|x))
  • The quantity

a = ln p(C1|x) p(C2|x)

  • is called the log-odds.
  • Logistic Regression assumes that the log-odds are a linear

function of the feature vector: a = wTx + w0.

  • Often true given assumptions about the class-conditional

densities p(x|Ck).

slide-59
SLIDE 59

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models

  • With 2 classes, C1 and C2:

p(C1|x) = p(x|C1)p(C1) p(x) Bayes’ Rule p(C1|x) = p(x|C1)p(C1) p(x, C1) + p(x, C2) Sum rule p(C1|x) = p(x|C1)p(C1) p(x|C1)p(C1) + p(x|C2)p(C2) Product rule

  • In generative models we specify the class-conditional

distribution p(x|Ck) which generates the data for each class

slide-60
SLIDE 60

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models

  • With 2 classes, C1 and C2:

p(C1|x) = p(x|C1)p(C1) p(x) Bayes’ Rule p(C1|x) = p(x|C1)p(C1) p(x, C1) + p(x, C2) Sum rule p(C1|x) = p(x|C1)p(C1) p(x|C1)p(C1) + p(x|C2)p(C2) Product rule

  • In generative models we specify the class-conditional

distribution p(x|Ck) which generates the data for each class

slide-61
SLIDE 61

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models

  • With 2 classes, C1 and C2:

p(C1|x) = p(x|C1)p(C1) p(x) Bayes’ Rule p(C1|x) = p(x|C1)p(C1) p(x, C1) + p(x, C2) Sum rule p(C1|x) = p(x|C1)p(C1) p(x|C1)p(C1) + p(x|C2)p(C2) Product rule

  • In generative models we specify the class-conditional

distribution p(x|Ck) which generates the data for each class

slide-62
SLIDE 62

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models

  • With 2 classes, C1 and C2:

p(C1|x) = p(x|C1)p(C1) p(x) Bayes’ Rule p(C1|x) = p(x|C1)p(C1) p(x, C1) + p(x, C2) Sum rule p(C1|x) = p(x|C1)p(C1) p(x|C1)p(C1) + p(x|C2)p(C2) Product rule

  • In generative models we specify the class-conditional

distribution p(x|Ck) which generates the data for each class

slide-63
SLIDE 63

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models - Example

  • Let’s say we observe x which is the current temperature
  • Determine if we are in Vancouver (C1) or Honolulu (C2)
  • Generative model:

p(C1|x) = p(x|C1)p(C1) p(x|C1)p(C1) + p(x|C2)p(C2)

  • p(x|C1) is distribution over typical temperatures in

Vancouver

  • e.g. p(x|C1) = N(x; 10, 5)
  • p(x|C2) is distribution over typical temperatures in Honolulu
  • e.g. p(x|C2) = N(x; 25, 5)
  • Class priors p(C1) = 0.1, p(C2) = 0.9
  • p(C1|x = 15) =

0.0484·0.1 0.0484·0.1+0.0108·0.9 ≈ 0.33

slide-64
SLIDE 64

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models - Example

  • Let’s say we observe x which is the current temperature
  • Determine if we are in Vancouver (C1) or Honolulu (C2)
  • Generative model:

p(C1|x) = p(x|C1)p(C1) p(x|C1)p(C1) + p(x|C2)p(C2)

  • p(x|C1) is distribution over typical temperatures in

Vancouver

  • e.g. p(x|C1) = N(x; 10, 5)
  • p(x|C2) is distribution over typical temperatures in Honolulu
  • e.g. p(x|C2) = N(x; 25, 5)
  • Class priors p(C1) = 0.1, p(C2) = 0.9
  • p(C1|x = 15) =

0.0484·0.1 0.0484·0.1+0.0108·0.9 ≈ 0.33

slide-65
SLIDE 65

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

  • Suppose we have built a model for predicting the log-odds.

We can use it to compute the class probability as follows. p(C1|x) = p(x|C1)p(C1) p(x|C1)p(C1) + p(x|C2)p(C2) = 1 1 + exp(−a) ≡ σ(a) where a = ln p(x|C1)p(C1) p(x|C2)p(C2) = ln p(x, C1) p(x, C2).

slide-66
SLIDE 66

Discriminant Functions Generative Models Discriminative Models

Generalized Linear Models

  • Suppose we have built a model for predicting the log-odds.

We can use it to compute the class probability as follows. p(C1|x) = p(x|C1)p(C1) p(x|C1)p(C1) + p(x|C2)p(C2) = 1 1 + exp(−a) ≡ σ(a) where a = ln p(x|C1)p(C1) p(x|C2)p(C2) = ln p(x, C1) p(x, C2).

slide-67
SLIDE 67

Discriminant Functions Generative Models Discriminative Models

Logistic Sigmoid

−5 5 0.5 1

  • The function σ(a) =

1 1+exp(−a) is known as the logistic

sigmoid

  • It squashes the real axis down to [0, 1]
  • It is continuous and differentiable
slide-68
SLIDE 68

Discriminant Functions Generative Models Discriminative Models

Multi-class Extension

  • There is a generalization of the logistic sigmoid to K > 2

classes: p(Ck|x) = p(x|Ck)p(Ck)

  • j p(x|Cj)p(Cj)

= exp(ak)

  • j exp(aj)

where ak = ln p(x|Ck)p(Ck)

  • a. k. a. softmax function
  • If some ak ≫ aj, p(Ck|x) goes to 1
slide-69
SLIDE 69

Discriminant Functions Generative Models Discriminative Models

Multi-class Extension

  • There is a generalization of the logistic sigmoid to K > 2

classes: p(Ck|x) = p(x|Ck)p(Ck)

  • j p(x|Cj)p(Cj)

= exp(ak)

  • j exp(aj)

where ak = ln p(x|Ck)p(Ck)

  • a. k. a. softmax function
  • If some ak ≫ aj, p(Ck|x) goes to 1
slide-70
SLIDE 70

Discriminant Functions Generative Models Discriminative Models

Example Logistic Regression

−6 −4 −2 2 4 6 −6 −4 −2 2 4 6

slide-71
SLIDE 71

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

  • Back to the log-odds a in the logistic sigmoid for 2 classes
  • Let’s assume the class-conditional densities p(x|Ck) are

Gaussians, and have the same covariance matrix Σ: p(x|Ck) = 1 (2π)D/2|Σ|1/2 exp

  • −1

2(x − µk)TΣ−1(x − µk)

  • a takes a simple form:

a = ln p(x|C1)p(C1) p(x|C2)p(C2) = wTx + w0

  • Note that quadratic terms xTΣ−1x cancel
slide-72
SLIDE 72

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

  • Back to the log-odds a in the logistic sigmoid for 2 classes
  • Let’s assume the class-conditional densities p(x|Ck) are

Gaussians, and have the same covariance matrix Σ: p(x|Ck) = 1 (2π)D/2|Σ|1/2 exp

  • −1

2(x − µk)TΣ−1(x − µk)

  • a takes a simple form:

a = ln p(x|C1)p(C1) p(x|C2)p(C2) = wTx + w0

  • Note that quadratic terms xTΣ−1x cancel
slide-73
SLIDE 73

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

  • Back to the log-odds a in the logistic sigmoid for 2 classes
  • Let’s assume the class-conditional densities p(x|Ck) are

Gaussians, and have the same covariance matrix Σ: p(x|Ck) = 1 (2π)D/2|Σ|1/2 exp

  • −1

2(x − µk)TΣ−1(x − µk)

  • a takes a simple form:

a = ln p(x|C1)p(C1) p(x|C2)p(C2) = wTx + w0

  • Note that quadratic terms xTΣ−1x cancel
slide-74
SLIDE 74

Discriminant Functions Generative Models Discriminative Models

Gaussian Class-Conditional Densities

  • Back to the log-odds a in the logistic sigmoid for 2 classes
  • Let’s assume the class-conditional densities p(x|Ck) are

Gaussians, and have the same covariance matrix Σ: p(x|Ck) = 1 (2π)D/2|Σ|1/2 exp

  • −1

2(x − µk)TΣ−1(x − µk)

  • a takes a simple form:

a = ln p(x|C1)p(C1) p(x|C2)p(C2) = wTx + w0

  • Note that quadratic terms xTΣ−1x cancel
slide-75
SLIDE 75

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

  • We can fit the parameters to this model using maximum

likelihood

  • Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1 − π
  • Refer to as θ
  • For a datapoint xn from class C1 (tn = 1):

p(xn, C1) = p(C1)p(xn|C1) = πN(xn|µ1, Σ)

  • For a datapoint xn from class C2 (tn = 0):

p(xn, C2) = p(C2)p(xn|C2) = (1 − π)N(xn|µ2, Σ)

slide-76
SLIDE 76

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

  • We can fit the parameters to this model using maximum

likelihood

  • Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1 − π
  • Refer to as θ
  • For a datapoint xn from class C1 (tn = 1):

p(xn, C1) = p(C1)p(xn|C1) = πN(xn|µ1, Σ)

  • For a datapoint xn from class C2 (tn = 0):

p(xn, C2) = p(C2)p(xn|C2) = (1 − π)N(xn|µ2, Σ)

slide-77
SLIDE 77

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

  • We can fit the parameters to this model using maximum

likelihood

  • Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1 − π
  • Refer to as θ
  • For a datapoint xn from class C1 (tn = 1):

p(xn, C1) = p(C1)p(xn|C1) = πN(xn|µ1, Σ)

  • For a datapoint xn from class C2 (tn = 0):

p(xn, C2) = p(C2)p(xn|C2) = (1 − π)N(xn|µ2, Σ)

slide-78
SLIDE 78

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

  • The likelihood of the training data is:

p(t|π, µ1, µ2, Σ) =

N

  • n=1

[πN(xn|µ1, Σ)]tn[(1−π)N(xn|µ2, Σ)]1−tn

  • As usual, ln is our friend:

ℓ(t; θ) =

N

  • n=1

tn ln π + (1 − tn) ln(1 − π)

  • π

+ tn ln N1 + (1 − tn) ln N2

  • µ1,µ2,Σ
  • Maximize for each separately
slide-79
SLIDE 79

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning

  • The likelihood of the training data is:

p(t|π, µ1, µ2, Σ) =

N

  • n=1

[πN(xn|µ1, Σ)]tn[(1−π)N(xn|µ2, Σ)]1−tn

  • As usual, ln is our friend:

ℓ(t; θ) =

N

  • n=1

tn ln π + (1 − tn) ln(1 − π)

  • π

+ tn ln N1 + (1 − tn) ln N2

  • µ1,µ2,Σ
  • Maximize for each separately
slide-80
SLIDE 80

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Class Priors

  • Maximization with respect to the class priors parameter π

is straightforward: ∂ ∂πℓ(t; θ) =

N

  • n=1

tn π − 1 − tn 1 − π ⇒ π = N1 N1 + N2

  • N1 and N2 are the number of training points in each class
  • Prior is simply the fraction of points in each class
slide-81
SLIDE 81

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Class Priors

  • Maximization with respect to the class priors parameter π

is straightforward: ∂ ∂πℓ(t; θ) =

N

  • n=1

tn π − 1 − tn 1 − π ⇒ π = N1 N1 + N2

  • N1 and N2 are the number of training points in each class
  • Prior is simply the fraction of points in each class
slide-82
SLIDE 82

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Class Priors

  • Maximization with respect to the class priors parameter π

is straightforward: ∂ ∂πℓ(t; θ) =

N

  • n=1

tn π − 1 − tn 1 − π ⇒ π = N1 N1 + N2

  • N1 and N2 are the number of training points in each class
  • Prior is simply the fraction of points in each class
slide-83
SLIDE 83

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Gaussian Parameters

  • The other parameters can also be found in the same

fashion

  • Class means:

µ1 = 1 N1

N

  • n=1

tnxn µ2 = 1 N2

N

  • n=1

(1 − tn)xn

  • Means of training examples from each class
  • Shared covariance matrix:

Σ = N1 N 1 N1

  • n∈C1

(xn−µ1)(xn−µ1)T+N2 N 1 N2

  • n∈C2

(xn−µ2)(xn−µ2)T

  • Weighted average of class covariances
slide-84
SLIDE 84

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Gaussian Parameters

  • The other parameters can also be found in the same

fashion

  • Class means:

µ1 = 1 N1

N

  • n=1

tnxn µ2 = 1 N2

N

  • n=1

(1 − tn)xn

  • Means of training examples from each class
  • Shared covariance matrix:

Σ = N1 N 1 N1

  • n∈C1

(xn−µ1)(xn−µ1)T+N2 N 1 N2

  • n∈C2

(xn−µ2)(xn−µ2)T

  • Weighted average of class covariances
slide-85
SLIDE 85

Discriminant Functions Generative Models Discriminative Models

Probabilistic Generative Models Summary

  • Fitting Gaussian using ML criterion is sensitive to outliers
  • Simple linear form for a in logistic sigmoid occurs for more

than just Gaussian distributions

  • Arises for any distribution in the exponential family, a large

class of distributions

slide-86
SLIDE 86

Discriminant Functions Generative Models Discriminative Models

Outline

Discriminant Functions Generative Models Discriminative Models

slide-87
SLIDE 87

Discriminant Functions Generative Models Discriminative Models

Probabilistic Discriminative Models

  • Generative model made assumptions about form of

class-conditional distributions (e.g. Gaussian)

  • Resulted in logistic sigmoid of linear function of x
  • Discriminative model - explicitly use functional form

p(C1|x) = 1 1 + exp(−wTx + w0) and find w directly

  • For the generative model we had 2M + M(M + 1)/2 + 1

parameters

  • M is dimensionality of x
  • Discriminative model will have M + 1 parameters
slide-88
SLIDE 88

Discriminant Functions Generative Models Discriminative Models

Probabilistic Discriminative Models

  • Generative model made assumptions about form of

class-conditional distributions (e.g. Gaussian)

  • Resulted in logistic sigmoid of linear function of x
  • Discriminative model - explicitly use functional form

p(C1|x) = 1 1 + exp(−wTx + w0) and find w directly

  • For the generative model we had 2M + M(M + 1)/2 + 1

parameters

  • M is dimensionality of x
  • Discriminative model will have M + 1 parameters
slide-89
SLIDE 89

Discriminant Functions Generative Models Discriminative Models

Probabilistic Discriminative Models

  • Generative model made assumptions about form of

class-conditional distributions (e.g. Gaussian)

  • Resulted in logistic sigmoid of linear function of x
  • Discriminative model - explicitly use functional form

p(C1|x) = 1 1 + exp(−wTx + w0) and find w directly

  • For the generative model we had 2M + M(M + 1)/2 + 1

parameters

  • M is dimensionality of x
  • Discriminative model will have M + 1 parameters
slide-90
SLIDE 90

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model

  • As usual we can use the maximum likelihood criterion for

learning p(t|w) =

N

  • n=1

ytn

n {1 − yn}1−tn ; where yn = p(C1|xn)

  • Taking ln and derivative gives:

∇ℓ(w) =

N

  • n=1

(yn − tn)xn

  • This time no closed-form solution since yn = σ(wTx)
  • Could use (stochastic) gradient descent
  • But Iterative Reweighted Least Squares (IRLS) is a better

technique.

slide-91
SLIDE 91

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model

  • As usual we can use the maximum likelihood criterion for

learning p(t|w) =

N

  • n=1

ytn

n {1 − yn}1−tn ; where yn = p(C1|xn)

  • Taking ln and derivative gives:

∇ℓ(w) =

N

  • n=1

(yn − tn)xn

  • This time no closed-form solution since yn = σ(wTx)
  • Could use (stochastic) gradient descent
  • But Iterative Reweighted Least Squares (IRLS) is a better

technique.

slide-92
SLIDE 92

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model

  • As usual we can use the maximum likelihood criterion for

learning p(t|w) =

N

  • n=1

ytn

n {1 − yn}1−tn ; where yn = p(C1|xn)

  • Taking ln and derivative gives:

∇ℓ(w) =

N

  • n=1

(yn − tn)xn

  • This time no closed-form solution since yn = σ(wTx)
  • Could use (stochastic) gradient descent
  • But Iterative Reweighted Least Squares (IRLS) is a better

technique.

slide-93
SLIDE 93

Discriminant Functions Generative Models Discriminative Models

Maximum Likelihood Learning - Discriminative Model

  • As usual we can use the maximum likelihood criterion for

learning p(t|w) =

N

  • n=1

ytn

n {1 − yn}1−tn ; where yn = p(C1|xn)

  • Taking ln and derivative gives:

∇ℓ(w) =

N

  • n=1

(yn − tn)xn

  • This time no closed-form solution since yn = σ(wTx)
  • Could use (stochastic) gradient descent
  • But Iterative Reweighted Least Squares (IRLS) is a better

technique.

slide-94
SLIDE 94

Discriminant Functions Generative Models Discriminative Models

Generative vs. Discriminative

  • Generative models
  • Can generate synthetic

example data

  • Perhaps accurate

classification is equivalent to accurate synthesis

  • Support learning with missing

data

  • Tend to have more parameters
  • Require good model of class

distributions

  • Discriminative models
  • Only usable for classification
  • Don’t solve a harder problem

than you need to

  • Tend to have fewer parameters
  • Require good model of

decision boundary

slide-95
SLIDE 95

Discriminant Functions Generative Models Discriminative Models

Conclusion

  • Readings: Ch. 4.1.1-4.1.4, 4.1.7, 4.2.1-4.2.2, 4.3.1-4.3.3
  • Generalized linear models y(x) = f(wTx + w0)
  • Threshold/max function for f(·)
  • Minimize with least squares
  • Fisher criterion - class separation
  • Perceptron criterion - mis-classified examples
  • Probabilistic models: logistic sigmoid / softmax for f(·)
  • Generative model - assume class conditional densities in

exponential family; obtain sigmoid

  • Discriminative model - directly model posterior using

sigmoid (a. k. a. logistic regression, though classification)

  • Can learn either using maximum likelihood
  • All of these models are limited to linear decision

boundaries in feature space