COMP24111: Machine Learning and Optimisation
- Dr. Tingting Mu
COMP24111: Machine Learning and Optimisation Chapter 3: Logistic - - PowerPoint PPT Presentation
COMP24111: Machine Learning and Optimisation Chapter 3: Logistic Regression Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk Outline Understand the concept of likelihood. Know some simple ways to build a likelihood function for
1
2
3
What is the chance we
4
5
Gaussian Distribution:mean µ, variance σ 2 standard deviation σ N x µ,σ 2
1 2πσ 2 exp − x −µ
2
2σ 2 ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟
The above figure is from https://kanbanize.com/blog/normal-gaussian-distribution-over-cycle-time/
5 10
x
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
p(x)
µ=0,σ=1 µ=0,σ=2 µ=1,σ=1
Standard deviation quantifies the amount of variation of a set of data values.
6
i=1 N
i=1 N
7
i=1 N
i=1 N
2 i=1 N
This is the sum-of-squares error function in Chapter 2.
N x µ,β
1 2πβ −1 exp − β x −µ
2
2 ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟
8
mean µ, covariance matrix Σ : N x µ, Σ
1 2π
d Σ
exp − x − µ
T Σ−1 x − µ
2 ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟
Case 1:µ1 = µ2 = 0, σ1 =σ 2 =1, ρ = 0 Case 2 :µ1 = µ2 = 0, σ1 =σ 2 =1, ρ = 0.5 Case 3:µ1 = µ2 =1, σ1 = 0.2, σ 2 =1, ρ = 0
1 2 3
x1
1 2 3
x2 0.05 2 0.1
p(x1,x2)
2 0.15
x2 x1
0.2
0.05 2 0.1
p(x1,x2)
2 0.15
x2 x1
0.2
1 2 3
x1
1 2 3
x2 2 0.2 2 p(x1,x2) x2 x1 0.4
1 2 3
x1
1 2 3
x2
case 1 N x1 x2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ µ1 µ2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ , σ1
2
ρσ1σ 2 ρσ1σ 2 σ 2
2
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎛ ⎝ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ = 1 2πσ1σ 2 1− ρ2 exp − 1 2 1− ρ2 x1 −µ1
( )
2
σ1
2
+ x2 −µ2
( )
2
σ 2
2
− 2ρ x1 −µ1
( ) x2 −µ2 ( )
σ1σ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ correlation between x1 and x2 Covariance measures the joint variability of two random variables cov(x, y) = E[(x-E[x])(y-E[y])]
9
yθ x,1− y
1−y =
Assume different classes have different mean vectors (µ1 and µ2), but the same covariance matrix Σ.
Flip coin: p =θ1
yθ2 1−y =
θ1, if y =1 (head), θ2, if y = 0 (tail). ⎧ ⎨ ⎪ ⎩ ⎪
10
i=1 N
yi
1−yi i=1 N
yi
1−yi i=1 N
i=1 N
i=1 N
i=1 N
11
∂O α,µ1,µ2, Σ
∂α = 0 ⇒ α* = N1 N ∂O α,µ1,µ2, Σ
∂µ1 = 0 ⇒ µ*
1 = 1
N1 yixi
i=1 N
∂O α,µ1,µ2, Σ
∂µ2 = 0 ⇒ µ*
2 = 1
N2 1− yi
i=1 N
∂O α,µ1,µ2, Σ
∂Σ = 0 ⇒ Σ* = N1 N Σ1 + N2 N Σ2 where ΣC = 1 NC xi −µC
i∈Class C
xi −µC
T ,C =1,2
simply the fraction of the training samples in that class.
simply the averaged training samples in that class.
weighted average of the covariance matrices associated with each of the two classes.
12
training samples and separation boundary
Red region:
p y = class A, x α*,µ1
*,µ2 *, Σ*
( )
< p y = class B, x α*,µ1
*,µ2 *, Σ*
( )
Blue region:
p y = class A, x α*,µ1
*,µ2 *, Σ*
( )
≥ p y = class B, x α*,µ1
*,µ2 *, Σ*
( )
13
yθ x,0
1−y
i=1 N
14
Given class label y ∈ 0,1
{ }:
p y x
y θ y = 0 x
⎡ ⎣ ⎤ ⎦
1−y
=θ y =1 x
y 1−θ y =1 x
⎡ ⎣ ⎤ ⎦
1−y
Likelihood = p yi xi
i=1 N
,
: Given an observe sample x, the probability it is from class 1.
: Given an observe sample x, the probability it is from class 0.
This is a linear model as learned in Chapter 2. This is called logistic sigmoid function.
σ x
( ) =
1 1+exp −x
( )
15
L w
σ wT ! xi
yi 1−σ wT !
xi
⎡ ⎣ ⎤ ⎦
1−yi i=1 N
O w
yiσ wT ! xi
i=1 N
− 1− yi
xi
i=1 N
∇O w
σ wT ! xi
i=1 N
! xi
p y x
y 1−θ y =1 x
⎡ ⎣ ⎤ ⎦
1−y
Likelihood = p yi xi
i=1 N
, Logistic sigmoid function: σ wT ! x
1 1+exp −wT ! x
Remember linear least squares model? ∇O w
( ) =
! xi
Tw(t) − yi
xi
i=1 N
16
w(t+1) = w(t) −η∇O w(t)
w(t+1) = w(t) − H−1∇O w(t)
O w
2 YTY − wT ! XTY + 1 2 wT ! XT ! Xw, ∇O w
XTY + ! XT ! Xw, H = ∇ ∇O w
XTY + ! XT ! Xw
XT ! X
w(t+1) = w(t) − ! XT ! X
−1 − !
XTY + ! XT ! Xw(t)
XT ! X
−1 !
XTY
Find the optimal solution in one iteration!
17
∇O w
πi − yi
i=1 N
! xi = ! XT π − Y
H = πi 1−πi
i=1 N
! xi! xi
T = !
XTS ! X
Notations: πi =σ wT ! xi
( ) (scalar)
π = π1,π 2,…π N
[ ]
T (column vector)
S = π1 1−π1
( )
! π 2 1−π 2
( )
! " " # " ! π N 1−π N
( )
⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ (diagonal matrix)
w(t+1) = w(t) − H−1∇O w(t)
= w(t) − ! XTS ! X
−1 !
XT π − Y
= ! XTS ! X
−1 !
XTS ! Xw(t) + ! XT Y − π
⎡ ⎣ ⎤ ⎦ = ! XTS ! X
−1 !
XT S ! Xw(t) + Y − π
⎡ ⎣ ⎤ ⎦
= ! XTS ! X
−1 !
XTz z = S ! Xw(t) + Y − π
The logistic regression model optimised through Newton-Raphson update is known as iterative reweighted least squares (IRLS). w(t+1) = ! XTS(t) ! X
−1 !
XTS(t)z(t) z(t) = S(t) ! Xw(t) + Y − π (t)
π (t) = π1
(t),π 2 (t),…π N (t)
⎡ ⎣ ⎤ ⎦, πi
(t) =σ !
xi
Tw(t)
S(t) = diag s1
(t),s2 (t),…sN (t)
⎡ ⎣ ⎤ ⎦
(t) = πi (t) 1−πi (t)
18
5 5.5 6 6.5 7 7.5
1 1.5 2 2.5
Virginica Versicolour
y = wT ! x = w0 + w1x1 + w2x2
19
y1θ C2 x
y2!θ Cc x
yc =
yk , k=1 c
k=1 c
i=1 N
: Given an observed sample x, the probability it is from class k. We model it by a softmax function:
j=1 c
T !
f x
( ) = exp x ( ) Here, we use softmax function to estimate probability from linear model.
20
T !
T !
j=1 c
yik k=1 c
i=1 N
T !
T !
j=1 c
k=1 c
i=1 N
i=1 N
21
A basis function example:
φi x
( ) = exp
− x j −µij
( )
2 j=1 d
2σ i
2
⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ = exp − x − µi
2
2σ i
2
⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟.
µi and σi are basis function parameters. Another basis function example: Forr single input variable, φ x
( ) = 1, x, x2,…xD
⎡ ⎣ ⎤ ⎦
T .
This is known as polynomial regression. The case of D =1 becomes linear regression.
T
φi x
D can be viewed as a feature extractor.
22
Curve fitting task: construct a curve that has the best fit to a series of data points. Method: incorporate basis functions to a linear least squares model.
5 10
x
0.5 1
y
2 4 6
x
0.2 0.4 0.6 0.8 1 1.2
y
5 10
x
0.5 1
y
2 4 6
x
0.2 0.4 0.6 0.8 1 1.2
y
5 10
x
0.5 1
y
2 4 6
x
0.5 1 1.5
y D=1 D=3 D=7
T
Training samples. Regression curve.
23 D=1 D=3 D=7
5 10
x
0.5 1
y
2 4 6
x
0.5 1
y
5 10
x
0.5 1
y
2 4 6
x
0.5 1
y
5 10
x
0.5 1
y
2 4 6
x
0.5 1
y
T
Training samples. Testing samples. Regression curve.
5 10 15
x
0.5 1
y
2 4 6
x
0.5 1
y
ground truth
Testing the fitted curve with new points.
24
2, x2 2, x1x2
T
training samples and separation boundary
25