Lecture 9:
−Logistic Regression −Discriminative vs. Generative Classification −Linear Discriminant Functions
Aykut Erdem
October 2017 Hacettepe University
Lecture 9: Logistic Regression Discriminative vs. Generative - - PowerPoint PPT Presentation
Lecture 9: Logistic Regression Discriminative vs. Generative Classification Linear Discriminant Functions Aykut Erdem October 2017 Hacettepe University Administrative Midterm exam will be held on November 6. Project proposals
−Logistic Regression −Discriminative vs. Generative Classification −Linear Discriminant Functions
Aykut Erdem
October 2017 Hacettepe University
2
3
– – ,… –
slide by Barnabás Póczos & Aarti Singh
4
NB Prediction for test data: For Class Prior For Likelihood
We need to estimate these probabilities!
19
slide by Barnabás Póczos & Aarti Singh
5
MEDLINE Article
slide by Dan Jurafsky
6
Typical additional assumption: Position in ¡document ¡doesn’t ¡matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
– “Bag ¡of ¡words” ¡model ¡– order of words on the page ignored The document is just a bag of words: i.i.d. words – Sounds really silly, but often works very well!
27
) K(50000-1) parameters to estimate
slide by Barnabás Póczos & Aarti Singh
Choosing a class: P(c|d5) P(j|d5)
1/4 * (2/9)3 * 2/9 * 2/9 ≈ 0.0001
Doc Words Class Training 1 Chinese Beijing Chinese c 2 Chinese Chinese Shanghai c 3 Chinese Macao c 4 Tokyo Japan Chinese j Test 5 Chinese Chinese Chinese Tokyo Japan ?
7
Conditional Probabilities: P(Chinese|c) = P(Tokyo|c) = P(Japan|c) = P(Chinese|j) = P(Tokyo|j) = P(Japan|j) = Priors: P(c)= P(j)= 3 4 1 4
(5+1) / (8+6) = 6/14 = 3/7 (0+1) / (8+6) = 1/14 (1+1) / (3+6) = 2/9 (0+1) / (8+6) = 1/14 (1+1) / (3+6) = 2/9 (1+1) / (3+6) = 2/9
3/4 * (3/7)3 * 1/14 * 1/14 ≈ 0.0003
∝ ∝
ˆ P(c) = Nc N
ˆ P(w | c) = count(w,c)+1 count(c)+ |V |
slide by Dan Jurafsky
Naïve Bayes (GNB):
mean and variance for each class k and each pixel i
Different mean and variance for each class k and each pixel i. Sometimes assume variance
8
slide by Barnabás Póczos & Aarti Singh
9
10
slide by Aarti Singh & Barnabás Póczos
Gaussian class conditional densities
11
.e.%%%%%%%%%%%?%
slide by Aarti Singh & Barnabás Póczos
Decision(boundary:(
d
Y
i=1
P(Xi|Y = 0)P(Y = 0) =
d
Y
i=1
P(Xi|Y = 1)P(Y = 1)
12
slide by Aarti Singh & Barnabás Póczos
Decision boundary:
Decision(boundary:(
log P(Y = 0) Qd
i=1 P(Xi|Y = 0)
P(Y = 1) Qd
i=1 P(Xi|Y = 1)
= log 1 − π π +
d
X
i=1
log P(Xi|Y = 0) P(Xi|Y = 1)
d
Y
i=1
P(Xi|Y = 0)P(Y = 0) =
d
Y
i=1
P(Xi|Y = 1)P(Y = 1)
13
slide by Aarti Singh & Barnabás Póczos
Decision boundary:
Decision(boundary:(
log P(Y = 0) Qd
i=1 P(Xi|Y = 0)
P(Y = 1) Qd
i=1 P(Xi|Y = 1)
= log 1 − π π +
d
X
i=1
log P(Xi|Y = 0) P(Xi|Y = 1)
d
Y
i=1
P(Xi|Y = 0)P(Y = 0) =
d
Y
i=1
P(Xi|Y = 1)P(Y = 1)
14
slide by Aarti Singh & Barnabás Póczos
Decision boundary:
Constant term First-order term
15
Decision(Boundary(
X = (x1, x2) P1 = P(Y = 0) P2 = P(Y = 1) p1(X) = p(X|Y = 0) ∼ N(M1, Σ1) p2(X) = p(X|Y = 1) ∼ N(M2, Σ2)
slide by Aarti Singh & Barnabás Póczos
Decision Boundary
boundary
16
slide by Aarti Singh & Barnabás Póczos
17 8%
Assumes%the%following%func$onal%form%for%P(Y|X):%
Logis&c( func&on( (or(Sigmoid):(
Logis$c%func$on%applied%to%a%linear% func$on%of%the%data%
z% logit%(z)%
Features(can(be(discrete(or(con&nuous!(
slide by Aarti Singh & Barnabás Póczos
Assumes the following functional form for P(Y∣X): Logistic function applied to linear function of the data
Logistic function (or Sigmoid):
Features can be discrete or continuous!
18 9%
Assumes%the%following%func$onal%form%for%P(Y|X):% % % % % Decision%boundary:%
(Linear Decision Boundary)
1% 1%
slide by Aarti Singh & Barnabás Póczos
Assumes the following functional form for P(Y∣X): Decision boundary:
19
1% 1% 1%
slide by Aarti Singh & Barnabás Póczos
Assumes the following functional form for P(Y∣X):
20
%
slide by Aarti Singh & Barnabás Póczos
Logistic regression in more general case, where for k<K for k=K (normalization, so no weights for this class)
21
12%
But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X)
slide by Aarti Singh & Barnabás Póczos
We’ll focus on binary classification:
22
12%
But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X)
slide by Aarti Singh & Barnabás Póczos
We’ll focus on binary classification:
23
Discrimina$ve%philosophy%–%Don’t%waste%effort%learning%P(X),% focus%on%P(Y|X)%–%that’s%all%that%maders%for%classifica$on!%
slide by Aarti Singh & Barnabás Póczos
Discriminative philosophy — Don’t waste effort learning P(X), focus on P(Y|X) — that’s all that matters for classification!
24
l
i=1 wiXi)
i=1 wiXi)
i=1 wiXi)
slide by Aarti Singh & Barnabás Póczos
25
l
slide by Aarti Singh & Barnabás Póczos
26
l
l
n n
slide by Aarti Singh & Barnabás Póczos
P(Y = 0|X) = 1 1+exp(w0 +∑n
i=1 wiXi)
P(Y = 1|X) = exp(w0 +∑n
i=1 wiXi)
1+exp(w0 +∑n
i=1 wiXi)
we can reexpress the log of the conditional likelihood
27
l
l
l
n
i
i )ln(1+exp(w0 + n
i
i ))
slide by Aarti Singh & Barnabás Póczos
P(Y = 0|X) = 1 1+exp(w0 +∑n
i=1 wiXi)
P(Y = 1|X) = exp(w0 +∑n
i=1 wiXi)
1+exp(w0 +∑n
i=1 wiXi)
we can reexpress the log of the conditional likelihood
28
Bad%news:%no%closed8form%solu$on%to%maximize%l(w) Good%news:%%l(w)%is%concave%func$on%of%w(!%concave%func$ons% easy%to%op$mize%(unique%maximum)%
slide by Aarti Singh & Barnabás Póczos
Bad news: no closed-form solution to maximize l(w) Good news: l(w) is concave function of w! concave functions easy to optimize (unique maximum)
29
Gradient(Ascent((concave)/(Gradient(Descent((convex)(
Gradient:( Learning(rate,(η>0( Update(rule:(
slide by Aarti Singh & Barnabás Póczos
Gradient Ascent (concave)/ Gradient Descent (convex)
Learning rate, ƞ>0 Gradient: Update rule:
30
Gradient%ascent%algorithm:%iterate%un$l%change%<%ε# # # For%i=1,…,d,%% % % repeat%%%%
Predict%what%current%weight% thinks%label%Y%should%be%
slide by Aarti Singh & Barnabás Póczos
Predict what current weight thinks label Y should be
− e.g. Newton method, Conjugate gradient ascent, IRLS (see Bishop 4.3.3)
repeat For i-1,…,d, Gradient ascent algorithm: iterate until change < ɛ
31
slide by Aarti Singh & Barnabás Póczos
Large ƞ → Fast convergence but larger residual error Also possible oscillations Small ƞ → Slow convergence but small residual error
32
Set(of(Gaussian(( Naïve(Bayes(parameters( (feature(variance(( independent(of(class(label)( Set(of(Logis&c(( Regression(parameters(
slide by Aarti Singh & Barnabás Póczos
− But only in a special case!!! (GNB with class-independent
variances)
33
Set(of(Gaussian(( Naïve(Bayes(parameters( (feature(variance(( independent(of(class(label)( Set(of(Logis&c(( Regression(parameters(
slide by Aarti Singh & Barnabás Póczos
− But only in a special case!!! (GNB with class-independent
variances)
− Optimize different functions! Obtain different solutions
34
slide by Aarti Singh & Barnabás Póczos
1,y, σ2 2,y, …, σ2 d,y)%%%%
Discriminative and generative NB perform similar. If conditional independence assumption does NOT holds, Discriminative outperforms generative NB.
35
slide by Aarti Singh & Barnabás Póczos
[Ng & Jordan, NIPS 2001]
36
slide by Aarti Singh & Barnabás Póczos
[Ng & Jordan, NIPS 2001]
Naïve Bayes (generative) requires n = O(log d) to converge to its asymptotic error, whereas Logistic regression (discriminative) requires n = O(d).
independently, not jointly, from training data.
37
slide by Aarti Singh & Barnabás Póczos
38
20 40 60 0.25 0.3 0.35 0.4 0.45 0.5 m error pima (continuous) 10 20 30 0.2 0.25 0.3 0.35 0.4 0.45 0.5 m error adult (continuous) 20 40 60 0.2 0.25 0.3 0.35 0.4 0.45 m error boston (predict if > median price, continuous) 50 100 150 200 0.1 0.2 0.3 0.4 m error
50 100 150 200 0.1 0.2 0.3 0.4 m error
20 40 60 80 100 0.1 0.2 0.3 0.4 0.5 m error ionosphere (continuous)
UCI Machine Learning Repository 15 datasets, 8 continuous features, 7 discrete features More in the paper...
slide by Aarti Singh & Barnabás Póczos
Naïve Bayes Logistic Regression
39
slide by Aarti Singh & Barnabás Póczos
− decision rule is a hyperplane
− no closed-form solution − concave ! global optimum with gradient ascent
representationally equivalent to LR
− Solution differs because of objective (loss) function
− NB: Features independent given class! assumption on P(X|Y) − LR: Functional form of P(Y|X), no assumption on P(X|Y)
− GNB (usually) needs less data − LR (usually) gets to better solutions in the limit
40
41
y(x) = wTx + w0
sign(a) = ( +1, a > 0 −1, a < 0
slide by Ce Liu
wTx kwk = w0 kwk the decision surface.
from the origin to the decision surface is
42
x2 x1 w x
y(x) kwk
x?
w0 kwk
y = 0 y < 0 y > 0 R2 R1
is perpendicular to w, and its displacement from the origin is controlled by the bias parameter w0.
a general point x from the decision surface is given by y(x)/||w||
perpendicular distance r of the point x from the decision surface
slide by Ce Liu
x2 x1 w x
y(x) kwk
x?
w0 kwk
y = 0 y < 0 y > 0 R2 R1
43
x = x⊥ + r w kwk
wTx = wTx⊥ + rwTw kwk wTx + w0 = wTx⊥ + w0 + rkwk y(x) = rkwk r = y(x) kwk
define e w = (w0, w)
and e x = (1, x)
e y(x) = e wTe x
slide by Ce Liu
44
R1 R2 R3 ? C1 not C1 C2 not C2
R1 R2 R3 ? C1 C2 C1 C3 C2 C3
slide by Ce Liu
45
k x + wk0
slide by Ce Liu
46
Theorem
The decision regions of the K-class discriminant yk(x) = wT
k x + wk0 are singly
connected and convex.
Proof.
Suppose two points xA and xB both lie inside decision region Rk. Any point ˆ x on the line between xA and xB can be expressed as ˆ x = λxA + (1 λ)xB So yk(ˆ x) = λyk(xA) + (1 λ)yk(xB) > λyj(xA) + (1 λ)yj(xB) (8 j 6= k) = yj(ˆ x) (8 j 6= k) Therefore, the regions Rk is single connected and convex.
slide by Ce Liu
47
Theorem
The decision regions of the K-class discriminant yk(x) = wT
k x + wk0 are singly
connected and convex.
Proof.
Suppose two points xA and xB both lie inside decision region Rk. Any point ˆ x on the line between xA and xB can be expressed as ˆ x = λxA + (1 λ)xB So yk(ˆ x) = λyk(xA) + (1 λ)yk(xB) > λyj(xA) + (1 λ)yj(xB) (8 j 6= k) = yj(ˆ x) (8 j 6= k) Therefore, the regions Rk is single connected and convex.
slide by Ce Liu
48
ul- decision re- line in be
Ri Rj Rk xA xB ˆ x
Theorem
The decision regions of the K-class discriminant yk(x) = wT
k x + wk0 are singly
connected and convex.
If two points xA and xB both lie inside the same decision region Rk, then any point x that lies on the line connecting these two points must also lie in Rk, and hence the decision region must be singly connected and convex.
slide by Ce Liu
y = wTx
can be maximally separated
49
−2 2 6 −2 2 4 −2 2 6 −2 2 4
Difference of means Fisher’s Linear Discriminant
m1 = 1 N1 X
n∈C1
xn, m2 = 1 N2 X
n∈C2
xn A way to view a linear classification model is in terms of dimensionality reduction.
slide by Ce Liu
⇣ wT(m1 − m2) ⌘2 = wT(m1 − m2)(m1 − m2)Tw = wTSBw
where SB = (m1 − m2)(m1 − m2)T is called between-class covariance matrix.
where
50
wTSWw
SW = X
n∈C1
(xn − m1)(xn − m1)T + X
n∈C2
(xn − m2)(xn − m2)T
slide by Ce Liu
about directions. So the scalars are dropped. Therefore
51
J(w) = Between-class variance Within-class variance = wTSBw wTSWw
for f(x) = g(x)
h(x)
f 0(x) = g0(x)h(x) g(x)h0(x) h2(x)
(wTSBw)SWw = (wTSWw)SBw (wTSBw)SWw = (wTSWw)(m2 m1) ⇣ (m2 m1)Tw ⌘
w / S1
W (m2 m1)
slide by Ce Liu
classification problem to 1D.
thresholds lead to nonlinear classifiers). The final classifier has the form where the nonlinear activation function sign(·) is a step function
52
y(x) = sign(wTx + w0)
· sign(a) = ( +1, a > 0 −1, a < 0
slide by Ce Liu