COMP24111: Machine Learning and Optimisation Chapter 3: Logistic - - PowerPoint PPT Presentation

comp24111 machine learning and optimisation
SMART_READER_LITE
LIVE PREVIEW

COMP24111: Machine Learning and Optimisation Chapter 3: Logistic - - PowerPoint PPT Presentation

COMP24111: Machine Learning and Optimisation Chapter 3: Logistic Regression Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk Outline Understand the concept of likelihood. Know some simple ways to build a likelihood function for


slide-1
SLIDE 1

COMP24111: Machine Learning and Optimisation

  • Dr. Tingting Mu

Email: tingting.mu@manchester.ac.uk

Chapter 3: Logistic Regression

slide-2
SLIDE 2

Outline

  • Understand the concept of likelihood.
  • Know some simple ways to build a likelihood function for

classification and regression.

  • Understand logistic regression model.
  • Understand Newton-Raphson Update, and iterative reweighted

least squares.

  • Understand linear basis function model (a nonlinear model).

1

slide-3
SLIDE 3
  • The model assumes

a linear relationship between the input variables and the estimated output variables.

  • Model parameters

are fitted by minimising sum of squares error.

Linear Regression: Least Square (Chapter 2)

2

ˆ y = wT ! x

1 1.5 2 2.5 3

x

2 4 6 8 10 12

y

A different way to interpret this?

slide-4
SLIDE 4
  • Assume the output

variable is a random number.

  • It is generated by

adding noise to a linear function.

Probabilistic View

3

1 1.5 2 2.5 3

x

2 4 6 8 10 12

y

y = f x

( )+ noise

= wT ! x + noise

Op3mise w by maximising the chances of

  • bserving training

samples.

What is the chance we

  • bserve this sample?
slide-5
SLIDE 5

Likelihood

  • In informal context, likelihood means probability.
  • It is a function of parameters of a statistical model, computed

with the given data.

  • A more formal definition: The likelihood of a set of parameter

values (w) given the observed data (x) is the probability assumed for the observed data given those parameter values.

  • Maximum likelihood estimator: the model parameters are
  • ptimised so that the probability of observing the training data

is maximised.

4

Likelihood w x

( ) = p x w ( )

L(w) for simplification

slide-6
SLIDE 6

Maximum Likelihood for Linear Regression

  • The output variable is a random number: .
  • Noise is a random number. It follows Gaussian distribution and has

zero mean (μ=0).

5

Gaussian Distribution:mean µ, variance σ 2 standard deviation σ N x µ,σ 2

( ) =

1 2πσ 2 exp − x −µ

( )

2

2σ 2 ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟

The above figure is from https://kanbanize.com/blog/normal-gaussian-distribution-over-cycle-time/

  • 10
  • 5

5 10

x

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

p(x)

µ=0,σ=1 µ=0,σ=2 µ=1,σ=1

y=wT ! x + noise

Standard deviation quantifies the amount of variation of a set of data values.

slide-7
SLIDE 7
  • Because , the output variable also follows Gaussian

distribution and its mean is

  • Probability of observing the i-th training sample:
  • Probability of observing all the N training samples:

6

y = wT ! x + noise µ = wT ! x

p y x,w,β

( ) = N

y wT ! x,β −1

( )

β: noise precision (inverse variance). β-1=σ2

p yi xi,w,β

( ) = N

yi wT ! xi,β −1

( )

p Y X,w,β

( ) =

p yi xi,w,β

( )

i=1 N

= N yi wT ! xi,β −1

( )

i=1 N

Maximum Likelihood for Linear Regression

slide-8
SLIDE 8
  • Likelihood function:
  • Log-likelihood function: taking the logarithm of the likelihood function
  • Optimise w by maximising the log likelihood function is equivalent to

minimising the sum-of-squares error function, under the assumption of additive zero-mean Gaussian noise.

7

L w,β

( ) =

N yi wT ! xi,β −1

( )

i=1 N

O w,β

( ) = ln L w,β ( )

( ) =

lnN yi wT ! xi,β −1

( )

i=1 N

= N 2 lnβ − N 2 ln 2π

( )− β 1

2 yi − wT ! xi

( )

2 i=1 N

This is the sum-of-squares error function in Chapter 2.

Maximum Likelihood for Linear Regression

N x µ,β

( ) =

1 2πβ −1 exp − β x −µ

( )

2

2 ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟

slide-9
SLIDE 9

Multivariate Gaussian Distribution

  • Multivariate Gaussian Distribution:
  • A bi-variate example:

8

mean µ, covariance matrix Σ : N x µ, Σ

( ) =

1 2π

( )

d Σ

exp − x − µ

( )

T Σ−1 x − µ

( )

2 ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟

Case 1:µ1 = µ2 = 0, σ1 =σ 2 =1, ρ = 0 Case 2 :µ1 = µ2 = 0, σ1 =σ 2 =1, ρ = 0.5 Case 3:µ1 = µ2 =1, σ1 = 0.2, σ 2 =1, ρ = 0

  • 3
  • 2
  • 1

1 2 3

x1

  • 3
  • 2
  • 1

1 2 3

x2 0.05 2 0.1

p(x1,x2)

2 0.15

x2 x1

0.2

  • 2
  • 2

0.05 2 0.1

p(x1,x2)

2 0.15

x2 x1

0.2

  • 2
  • 2
  • 3
  • 2
  • 1

1 2 3

x1

  • 3
  • 2
  • 1

1 2 3

x2 2 0.2 2 p(x1,x2) x2 x1 0.4

  • 2
  • 2
  • 3
  • 2
  • 1

1 2 3

x1

  • 3
  • 2
  • 1

1 2 3

x2

case 1 N x1 x2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ µ1 µ2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ , σ1

2

ρσ1σ 2 ρσ1σ 2 σ 2

2

⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎛ ⎝ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ = 1 2πσ1σ 2 1− ρ2 exp − 1 2 1− ρ2 x1 −µ1

( )

2

σ1

2

+ x2 −µ2

( )

2

σ 2

2

− 2ρ x1 −µ1

( ) x2 −µ2 ( )

σ1σ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ correlation between x1 and x2 Covariance measures the joint variability of two random variables cov(x, y) = E[(x-E[x])(y-E[y])]

slide-10
SLIDE 10
  • The probability of observing a sample belonging to one of the two

possible classes follows the Bernoulli distribution (a simple probabilistic model for fliping coins):

  • Samples from each class are random variables following Gaussian

distribution.

9

p x, y

( ) =θ x, y ( )

yθ x,1− y

( )

1−y =

θ x,1

( ),

if y =1, θ x,0

( ),

if y = 0. ⎧ ⎨ ⎪ ⎩ ⎪ θ x,1

( ) = p C1 ( ) p x C1

( ) =αN

x µ1,Σ

( )

θ x,0

( ) = p C2 ( ) p x C2

( ) = 1−α

( )N

x µ2,Σ

( )

Prior class probability: p C1

( ) =α

p C2

( ) = 1−α ( )

Assume different classes have different mean vectors (µ1 and µ2), but the same covariance matrix Σ.

Flip coin: p =θ1

yθ2 1−y =

θ1, if y =1 (head), θ2, if y = 0 (tail). ⎧ ⎨ ⎪ ⎩ ⎪

Maximum Likelihood for Binary Classification (Gaussian Distribution)

slide-11
SLIDE 11
  • Likelihood function:
  • Log-likelihood function

10

L α,µ1,µ2, Σ

( ) =

p xi, yi

( )

i=1 N

= αN xi µ1,Σ

( )

⎡ ⎣ ⎤ ⎦

yi

1−α

( )N

xi µ2,Σ

( )

⎡ ⎣ ⎤ ⎦

1−yi i=1 N

O α,µ1,µ2, Σ

( ) = ln

αN xi µ1,Σ

( )

⎡ ⎣ ⎤ ⎦

yi

1−α

( )N

xi µ2,Σ

( )

⎡ ⎣ ⎤ ⎦

1−yi i=1 N

⎧ ⎨ ⎩ ⎫ ⎬ ⎭ = yi lnα + 1− yi

( )ln 1−α ( )

⎡ ⎣ ⎤ ⎦

i=1 N

+ yi lnN xi µ1,Σ

( )

i=1 N

+ 1− yi

( )lnN

xi µ2,Σ

( )

i=1 N

Maximum Likelihood for Binary Classification (Gaussian Distribution)

slide-12
SLIDE 12
  • We need to decide the optimal setting of the following model parameters.
  • Optimal parameters are obtained by setting the gradients to zero.

11

∂O α,µ1,µ2, Σ

( )

∂α = 0 ⇒ α* = N1 N ∂O α,µ1,µ2, Σ

( )

∂µ1 = 0 ⇒ µ*

1 = 1

N1 yixi

i=1 N

∂O α,µ1,µ2, Σ

( )

∂µ2 = 0 ⇒ µ*

2 = 1

N2 1− yi

( )xi

i=1 N

∂O α,µ1,µ2, Σ

( )

∂Σ = 0 ⇒ Σ* = N1 N Σ1 + N2 N Σ2 where ΣC = 1 NC xi −µC

( )

i∈Class C

xi −µC

( )

T ,C =1,2

  • The prior probability of a class is

simply the fraction of the training samples in that class.

  • The mean vector of each class is

simply the averaged training samples in that class.

  • The covariance matrix is simply a

weighted average of the covariance matrices associated with each of the two classes.

α :class prior Σ : shared covariance matrix for both classes µ1 :mean vector of class 1 µ2 :mean vector of class 2

Maximum Likelihood for Binary Classification (Gaussian Distribution)

slide-13
SLIDE 13

12

Example: Binary Classification

20 training samples from class A, each characterised by 2 features. 20 training samples from class B, each characterised by 2 features.

training samples and separation boundary

Red region:

p y = class A, x α*,µ1

*,µ2 *, Σ*

( )

< p y = class B, x α*,µ1

*,µ2 *, Σ*

( )

Blue region:

p y = class A, x α*,µ1

*,µ2 *, Σ*

( )

≥ p y = class B, x α*,µ1

*,µ2 *, Σ*

( )

slide-14
SLIDE 14
  • We just used the following model to formulate a likelihood function

for binary classification.

  • Is there another way to formulate the likelihood function for

classification?

13

p x, y

( ) =θ x,1 ( )

yθ x,0

( )

1−y

θ x,1

( ) =αN

x µ1,Σ

( ),

θ x,0

( ) = 1−α ( )N

x µ2,Σ

( ),

L α,µ1,µ2, Σ

( ) =

p xi, yi

( )

i=1 N

,

: The probability

  • f observing (x, class 1).

θ x,1

( )

: The probability

  • f observing (x, class 2).

θ x,0

( )

slide-15
SLIDE 15
  • Another way to construct the likelihood function is:

14

Given class label y ∈ 0,1

{ }:

p y x

( ) =θ y =1 x ( )

y θ y = 0 x

( )

⎡ ⎣ ⎤ ⎦

1−y

=θ y =1 x

( )

y 1−θ y =1 x

( )

⎡ ⎣ ⎤ ⎦

1−y

Likelihood = p yi xi

( )

i=1 N

,

Logistic Regression: Binary Classification

: Given an observe sample x, the probability it is from class 1.

θ y =1 x

( )

θ y = 0 x

( )+θ y =1 x ( ) =1

: Given an observe sample x, the probability it is from class 0.

θ y = 0 x

( )

We directly model θ y =1 x

( ) =σ wT !

x

( ) =

1 1+exp −wT ! x

( )

θ y =1 x

( )

This is a linear model as learned in Chapter 2. This is called logistic sigmoid function.

σ x

( ) =

1 1+exp −x

( )

slide-16
SLIDE 16
  • Logistic regression likelihood function:
  • Cross-entropy error function is the negative

logarithm of the likelihood:

15

L w

( ) =

σ wT ! xi

( )

yi 1−σ wT !

xi

( )

⎡ ⎣ ⎤ ⎦

1−yi i=1 N

O w

( ) = −ln L w ( )

( ) = −

yiσ wT ! xi

( )

i=1 N

− 1− yi

( ) 1−σ wT !

xi

( )

( )

i=1 N

∇O w

( ) =

σ wT ! xi

( )− yi

( )

i=1 N

! xi

p y x

( ) =θ y =1 x ( )

y 1−θ y =1 x

( )

⎡ ⎣ ⎤ ⎦

1−y

Likelihood = p yi xi

( )

i=1 N

, Logistic sigmoid function: σ wT ! x

( ) =

1 1+exp −wT ! x

( )

Logistic Regression: Binary Classification

Remember linear least squares model? ∇O w

( ) =

! xi

Tw(t) − yi

( ) !

xi

i=1 N

  • Optimal setting of w is found by maximixing the likelihood (equivalent to

minimising the cross-entropy error). The gradient of the error function with respect to w is given by

slide-17
SLIDE 17

Newton-Raphson Update

  • Gradient descent update:
  • Newton-Raphson update:
  • Hessian matrix H=[Hij] consists of the second order derivatives of the
  • bjective function.

16

w(t+1) = w(t) −η∇O w(t)

( ), where η > 0 is learning rate.

w(t+1) = w(t) − H−1∇O w(t)

( ), where H is the Hession matrix. Hij = ∂2O ∂wiwj

Example: applying Newton Raphson update to linear least squares model we learned in Chapter 2. The update becomes

O w

( ) = 1

2 YTY − wT ! XTY + 1 2 wT ! XT ! Xw, ∇O w

( ) = − !

XTY + ! XT ! Xw, H = ∇ ∇O w

( )

( ) = ∇ − !

XTY + ! XT ! Xw

( ) = !

XT ! X

w(t+1) = w(t) − ! XT ! X

( )

−1 − !

XTY + ! XT ! Xw(t)

( ) = !

XT ! X

( )

−1 !

XTY

Find the optimal solution in one iteration!

slide-18
SLIDE 18

Iterative Reweighted Least Squares

  • Optimise the logistic regression model using Newton-Raphson update:

– Gradient vector and Hessian matrix: – Newton-Raphson update:

17

∇O w

( ) =

πi − yi

( )

i=1 N

! xi = ! XT π − Y

( )

H = πi 1−πi

( )

i=1 N

! xi! xi

T = !

XTS ! X

Notations: πi =σ wT ! xi

( ) (scalar)

π = π1,π 2,…π N

[ ]

T (column vector)

S = π1 1−π1

( )

! π 2 1−π 2

( )

! " " # " ! π N 1−π N

( )

⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ (diagonal matrix)

w(t+1) = w(t) − H−1∇O w(t)

( )

= w(t) − ! XTS ! X

( )

−1 !

XT π − Y

( )

= ! XTS ! X

( )

−1 !

XTS ! Xw(t) + ! XT Y − π

( )

⎡ ⎣ ⎤ ⎦ = ! XTS ! X

( )

−1 !

XT S ! Xw(t) + Y − π

( )

⎡ ⎣ ⎤ ⎦

{ }

= ! XTS ! X

( )

−1 !

XTz z = S ! Xw(t) + Y − π

( )

The logistic regression model optimised through Newton-Raphson update is known as iterative reweighted least squares (IRLS). w(t+1) = ! XTS(t) ! X

( )

−1 !

XTS(t)z(t) z(t) = S(t) ! Xw(t) + Y − π (t)

( )

π (t) = π1

(t),π 2 (t),…π N (t)

⎡ ⎣ ⎤ ⎦, πi

(t) =σ !

xi

Tw(t)

( )

S(t) = diag s1

(t),s2 (t),…sN (t)

⎡ ⎣ ⎤ ⎦

( ), si

(t) = πi (t) 1−πi (t)

( )

slide-19
SLIDE 19

18

Iris Classification Example

w optimised by normal equations.

5 5.5 6 6.5 7 7.5

sepal length in cm

1 1.5 2 2.5

petal width in cm

Virginica Versicolour

w optimised by IRLS update.

y = wT ! x = w0 + w1x1 + w2x2

slide-20
SLIDE 20
  • Each sample has c binary output variables, each indicating whether

the sample belongs to a class:

  • The probability of observing the output y given its input x is
  • Likelihood function:

19

p y x

( ) =θ C1 x ( )

y1θ C2 x

( )

y2!θ Cc x

( )

yc =

θ Ck x

( )

yk , k=1 c

where θ Ck x

( )

k=1 c

=1

y = y1, y2,…, yc

[ ], yk ∈ 0,1

{ }

Likelihood = p yi xi

( )

i=1 N

: Given an observed sample x, the probability it is from class k. We model it by a softmax function:

θ Ck x

( )

θ Ck x

( ) =

exp ak

( )

exp aj

( )

j=1 c

where ak = wk

T !

x.

Multi-class Logistic Regression

f x

( ) = exp x ( ) Here, we use softmax function to estimate probability from linear model.

slide-21
SLIDE 21

Multi-class Logistic Regression

  • Likelihood function computed over N observed training samples:
  • Cross-entropy error function:
  • Gradient of the error function with respect to the coefficient vector wk

20

L W

( ) =

exp wk

T !

xi

( )

exp w j

T !

xi

( )

j=1 c

⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟

yik k=1 c

i=1 N

O W

( ) = −lnL W ( ) = −

yik ln exp wk

T !

xi

( )

exp w j

T !

xi

( )

j=1 c

⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟

k=1 c

i=1 N

. ∂O W

( )

∂wk = πik − yik

( ) !

xi

i=1 N

.

slide-22
SLIDE 22

Linear Basis Function Model

  • So far we work with linear combination of the input variables. This gives

a linear model.

  • Nonlinear Model: Assume estimated output variable is a linear

combination of fixed nonlinear functions (basis functions) of the input variables.

21

ˆ y = wT ! x, where w =[w0,w1,w2,…wd]T and ! x = 1 x ⎡ ⎣ ⎢ ⎤ ⎦ ⎥.

A basis function example:

φi x

( ) = exp

− x j −µij

( )

2 j=1 d

2σ i

2

⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ = exp − x − µi

2

2σ i

2

⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟.

µi and σi are basis function parameters. Another basis function example: Forr single input variable, φ x

( ) = 1, x, x2,…xD

⎡ ⎣ ⎤ ⎦

T .

This is known as polynomial regression. The case of D =1 becomes linear regression.

ˆ y = w0 + w1φ1 x

( )+ w2φ2 x ( )+…+ wDφD x ( ) = wTφ x ( ),

where w =[w0,w1,w2,…wD]T and φ x

( ) = 1,φ1 x ( ), φ2 x ( ), … φD x ( )

⎡ ⎣ ⎤ ⎦

T

φi x

( ) { }i=1

D can be viewed as a feature extractor.

slide-23
SLIDE 23

A Regression Example

22

Curve fitting task: construct a curve that has the best fit to a series of data points. Method: incorporate basis functions to a linear least squares model.

5 10

x

  • 1
  • 0.5

0.5 1

y

2 4 6

x

0.2 0.4 0.6 0.8 1 1.2

y

5 10

x

  • 1
  • 0.5

0.5 1

y

2 4 6

x

0.2 0.4 0.6 0.8 1 1.2

y

5 10

x

  • 1
  • 0.5

0.5 1

y

2 4 6

x

  • 0.5

0.5 1 1.5

y D=1 D=3 D=7

ˆ y = wTφ x

( )

φ x

( ) = 1, x, x2,…xD

⎡ ⎣ ⎤ ⎦

T

Training samples. Regression curve.

slide-24
SLIDE 24

A Regression Example

23 D=1 D=3 D=7

5 10

x

  • 1
  • 0.5

0.5 1

y

2 4 6

x

  • 0.5

0.5 1

y

5 10

x

  • 1
  • 0.5

0.5 1

y

2 4 6

x

  • 0.5

0.5 1

y

5 10

x

  • 1
  • 0.5

0.5 1

y

2 4 6

x

  • 0.5

0.5 1

y

ˆ y = wTφ x

( )

φ x

( ) = 1, x, x2,…xD

⎡ ⎣ ⎤ ⎦

T

Training samples. Testing samples. Regression curve.

  • 5

5 10 15

x

  • 1
  • 0.5

0.5 1

y

  • 2

2 4 6

x

  • 1
  • 0.5

0.5 1

y

ground truth

Testing the fitted curve with new points.

slide-25
SLIDE 25

Iris Classification Example

24

Iris classification task. Incorporate basis functions to logistic regression model.

φ x

( ) = 1, x1, x2, x1

2, x2 2, x1x2

⎡ ⎣ ⎤ ⎦

T

training samples and separation boundary

slide-26
SLIDE 26

Summary

  • In this lecture, we have learned

– the concept of likelihood, – simple ways to build a likelihood function, – logistic regression model (IRLS algorithm) – linear basis function model (nonlinear extension of a linear model).

  • In next lecture, we will talk about support vector machines.

25