Lecture 9:
−Naïve Bayes Classifier (cont’d.) −Logistic Regression −Discriminative vs. Generative Classification −Linear Discriminant Functions
Aykut Erdem
October 2016 Hacettepe University
Lecture 9: Nave Bayes Classifier (contd.) Logistic Regression - - PowerPoint PPT Presentation
Lecture 9: Nave Bayes Classifier (contd.) Logistic Regression Discriminative vs. Generative Classification Linear Discriminant Functions Aykut Erdem October 2016 Hacettepe University Last time Nave Bayes Classifier
Lecture 9:
−Naïve Bayes Classifier (cont’d.) −Logistic Regression −Discriminative vs. Generative Classification −Linear Discriminant Functions
Aykut Erdem
October 2016 Hacettepe University
Last time… Naïve Bayes Classifier
2Given:
– Class prior P(Y) – d conditionally independent features X1,…Xd given the class label Y – For each Xi feature, we have the conditional likelihood P(Xi|Y)
Naïve Bayes Decision rule:
– – ,… –
slide by Barnabás Póczos & Aarti Singhdiscrete features
NB Prediction for test data: For Class Prior For Likelihood
We need to estimate these probabilities!
19Estimators
Last time… Naïve Bayes Algorithm for discrete features
slide by Barnabás Póczos & Aarti SinghLast time… Text Classification
MeSH Subject Category Hierarchy
?
MEDLINE Article
slide by Dan JurafskyLast time… Bag of words model
5Typical additional assumption: Position in ¡document ¡doesn’t ¡matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
– “Bag ¡of ¡words” ¡model ¡– order of words on the page ignored The document is just a bag of words: i.i.d. words – Sounds really silly, but often works very well!
27The probability of a document with words x1,x2,… ¡
) K(50000-1) parameters to estimate
slide by Barnabás Póczos & Aarti SinghThe bag of words representation
6I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale
just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale
just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.
The bag of words representation
slide by Dan Jurafskyx love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxx recommend xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
The bag of words representation: using a subset of words
slide by Dan Jurafskygreat 2 love 2 recommend 1 laugh 1 happy 1 ... ...
The bag of words representation
slide by Dan JurafskyChoosing a class: P(c|d5) P(j|d5)
1/4 * (2/9)3 * 2/9 * 2/9 ≈ 0.0001
Doc Words Class Training 1 Chinese Beijing Chinese c 2 Chinese Chinese Shanghai c 3 Chinese Macao c 4 Tokyo Japan Chinese j Test 5 Chinese Chinese Chinese Tokyo Japan ?
10Conditional Probabilities: P(Chinese|c) = P(Tokyo|c) = P(Japan|c) = P(Chinese|j) = P(Tokyo|j) = P(Japan|j) = Priors: P(c)= P(j)= 3 4 1 4
(5+1) / (8+6) = 6/14 = 3/7 (0+1) / (8+6) = 1/14 (1+1) / (3+6) = 2/9 (0+1) / (8+6) = 1/14 (1+1) / (3+6) = 2/9 (1+1) / (3+6) = 2/9
3/4 * (3/7)3 * 1/14 * 1/14 ≈ 0.0003
∝ ∝
ˆ P(c) = Nc N
ˆ P(w | c) = count(w,c)+1 count(c)+ |V |
slide by Dan JurafskyTwenty news groups results
11Naïve Bayes: 89% accuracy
slide by Barnabás Póczos & Aarti Singhmean and variance for each class k and each pixel i
Gaussian Naïve Bayes (GNB):
Different mean and variance for each class k and each pixel i. Sometimes assume variance
What if features are continuous?
12Naïve Bayes (GNB):
slide by Barnabás Póczos & Aarti Singhtinuous
ates:
Y discrete, Xi continuou
slide by Barnabás Póczos & Aarti SinghEstimating parameters: Y discrete, Xi continuous
Maximum likelihood estimates:
14tinuous
ates:
jth training image ith pixel in jth training image kth class
Estimating parameters: Y discrete, Xi continuous
slide by Barnabás Póczos & Aarti SinghCase Study: Classifying Mental States
Example: GNB for classifying mental states
16[Mitchell et al.]
~1 mm resolution ~2 images per sec. 15,000 voxels/image non-invasive, safe measures Blood Oxygen Level Dependent (BOLD) response
slide by Barnabás Póczos & Aarti Singhtrack activation with precision and sensitivity
slide by Barnabás Póczos & Aarti SinghLearned Naïve Bayes Models – Means for P(BrainActivity | WordCategory)
18Pairwise classification accuracy: 78-99%, 12 participants
Tool words Building
–
Building
words
Tool words
[Mitchell et al.]
slide by Barnabás Póczos & Aarti SinghWhat you should know…
Naïve Bayes classifier
Text classification
Gaussian NB
Logistic Regression
20:%
Last time… Naïve Bayes
Gaussian class conditional densities
Gaussian Naïve Bayes (GNB)
boundary.
i.e.
22.e.%%%%%%%%%%%?%
Gaussian class conditional densities
slide by Aarti Singh & Barnabás PóczosDecision(boundary:(
d
Y
i=1
P(Xi|Y = 0)P(Y = 0) =
d
Y
i=1
P(Xi|Y = 1)P(Y = 1)
23fier!(
slide by Aarti Singh & Barnabás PóczosGNB with equal variance is a Linear Classifier!
Decision boundary:
Decision(boundary:(
log P(Y = 0) Qd
i=1 P(Xi|Y = 0)P(Y = 1) Qd
i=1 P(Xi|Y = 1)= log 1 − π π +
dX
i=1log P(Xi|Y = 0) P(Xi|Y = 1)
d
Y
i=1
P(Xi|Y = 0)P(Y = 0) =
d
Y
i=1
P(Xi|Y = 1)P(Y = 1)
24fier!(
slide by Aarti Singh & Barnabás PóczosGNB with equal variance is a Linear Classifier!
Decision boundary:
Decision(boundary:(
log P(Y = 0) Qd
i=1 P(Xi|Y = 0)P(Y = 1) Qd
i=1 P(Xi|Y = 1)= log 1 − π π +
dX
i=1log P(Xi|Y = 0) P(Xi|Y = 1)
d
Y
i=1
P(Xi|Y = 0)P(Y = 0) =
d
Y
i=1
P(Xi|Y = 1)P(Y = 1)
25fier!(
slide by Aarti Singh & Barnabás PóczosGNB with equal variance is a Linear Classifier!
Decision boundary:
{
Constant term First-order term
Gaussian Naive Bayes (GNB)
26Decision(Boundary(
X = (x1, x2) P1 = P(Y = 0) P2 = P(Y = 1) p1(X) = p(X|Y = 0) ∼ N(M1, Σ1) p2(X) = p(X|Y = 1) ∼ N(M2, Σ2)
slide by Aarti Singh & Barnabás PóczosDecision Boundary
Generative vs. Discriminative Classifiers
the decision boundary directly?
boundary
Logistic Regression
28 8%Assumes%the%following%func$onal%form%for%P(Y|X):%
Logis&c( func&on( (or(Sigmoid):(
Logis$c%func$on%applied%to%a%linear% func$on%of%the%data%
z% logit%(z)%
Features(can(be(discrete(or(con&nuous!(
slide by Aarti Singh & Barnabás PóczosAssumes the following functional form for P(Y∣X): Logistic function applied to linear function of the data
Logistic function (or Sigmoid):
Features can be discrete or continuous!
Logistic Regression is a Linear Classifier!
29 9%Assumes%the%following%func$onal%form%for%P(Y|X):% % % % % Decision%boundary:%
(Linear Decision Boundary)
1% 1%
(Linear Decision Boundary)
slide by Aarti Singh & Barnabás PóczosAssumes the following functional form for P(Y∣X): Decision boundary:
Assumes%the%following%func$onal%form%for%P(Y|X) % % % %
1% 1% 1%
slide by Aarti Singh & Barnabás PóczosLogistic Regression is a Linear Classifier!
Assumes the following functional form for P(Y∣X):
Logistic Regression for more than 2 classes
31Y% {y1,…,yK}% %for%k<K% % %
%%for%k=K%(normaliza$on,%so%no%weights%for%this%class)% % %
∈
slide by Aarti Singh & Barnabás PóczosLogistic regression in more general case, where for k<K for k<K (normalization, so no weights for this class)
Training Logistic Regression
32 12%We’ll%focus%on%binary%classifica$on:%
% How(to(learn(the(parameters(w0,(w1,(…(wd?( Training%Data% Maximum%Likelihood%Es$mates% %
But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X)
slide by Aarti Singh & Barnabás PóczosWe’ll focus on binary classification:
Training Data Maximum Likelihood Estimates How to learn the parameters w0, w1, …, wd?
Training Logistic Regression
33 12%We’ll%focus%on%binary%classifica$on:%
% How(to(learn(the(parameters(w0,(w1,(…(wd?( Training%Data% Maximum%Likelihood%Es$mates% %
But there is a problem … Don’t have a model for P(X) or P(X|Y) – only for P(Y|X)
But there is a problem … Don’t have a model for P(X) or P(X|Y) — only for P(Y|X)
slide by Aarti Singh & Barnabás PóczosWe’ll focus on binary classification:
How to learn the parameters w0, w1, …, wd? But there is a problem… Don’t have a model for P(X) or P(X|Y) — only for P(Y|X) Training Data Maximum Likelihood Estimates
Training Logistic Regression
34How(to(learn(the(parameters(w0,(w1,(…(wd?( Training%Data% Maximum%(Condi$onal)%Likelihood%Es$mates% % % %
Discrimina$ve%philosophy%–%Don’t%waste%effort%learning%P(X),% focus%on%P(Y|X)%–%that’s%all%that%maders%for%classifica$on!%
slide by Aarti Singh & Barnabás PóczosHow to learn the parameters w0, w1, …, wd?
Training Data Maximum (Conditional) Likelihood Estimates
Discriminative philosophy — Don’t waste effort learning P(X), focus on P(Y|X) — that’s all that matters for classification!
Expressing Conditional log Likelihood
35l(W) = ∑
l
Y l lnP(Y l = 1|Xl,W)+(1Y l)lnP(Y l = 0|Xl,W)
P(Y = 0|X) = 1 1+exp(w0 +∑n
i=1 wiXi)
P(Y = 1|X) = exp(w0 +∑n
i=1 wiXi)
1+exp(w0 +∑n
i=1 wiXi)
we can reexpress the log of the conditional likelihood Y can take only values 0 or 1, so only one of the two terms in the expression will be non-zero for any given Yl
slide by Aarti Singh & Barnabás PóczosExpressing Conditional log Likelihood
36l(W) = ∑
l
Y l lnP(Y l = 1|Xl,W)+(1Y l)lnP(Y l = 0|Xl,W)
slide by Aarti Singh & Barnabás PóczosExpressing Conditional log Likelihood
37l(W) = ∑
l
Y l lnP(Y l = 1|Xl,W)+(1Y l)lnP(Y l = 0|Xl,W) = ∑
l
Y l ln P(Y l = 1|Xl,W) P(Y l = 0|Xl,W) +lnP(Y l = 0|Xl,W)
n n
slide by Aarti Singh & Barnabás PóczosP(Y = 0|X) = 1 1+exp(w0 +∑n
i=1 wiXi)P(Y = 1|X) = exp(w0 +∑n
i=1 wiXi)1+exp(w0 +∑n
i=1 wiXi)we can reexpress the log of the conditional likelihood
Expressing Conditional log Likelihood
38l(W) = ∑
l
Y l lnP(Y l = 1|Xl,W)+(1Y l)lnP(Y l = 0|Xl,W) = ∑
l
Y l ln P(Y l = 1|Xl,W) P(Y l = 0|Xl,W) +lnP(Y l = 0|Xl,W) = ∑
l
Y l(w0 +
n
∑
i
wiXl
i )ln(1+exp(w0 + n
∑
i
wiXl
i ))
slide by Aarti Singh & Barnabás PóczosP(Y = 0|X) = 1 1+exp(w0 +∑n
i=1 wiXi)P(Y = 1|X) = exp(w0 +∑n
i=1 wiXi)1+exp(w0 +∑n
i=1 wiXi)we can reexpress the log of the conditional likelihood
Maximizing Conditional log Likelihood
39Bad%news:%no%closed8form%solu$on%to%maximize%l(w) Good%news:%%l(w)%is%concave%func$on%of%w(!%concave%func$ons% easy%to%op$mize%(unique%maximum)%
slide by Aarti Singh & Barnabás PóczosBad news: no closed-form solution to maximize l(w) Good news: l(w) is concave function of w! concave functions easy to optimize (unique maximum)
Optimizing concave/convex functions
40Gradient(Ascent((concave)/(Gradient(Descent((convex)(
Gradient:( Learning(rate,(η>0( Update(rule:(
slide by Aarti Singh & Barnabás PóczosGradient Ascent (concave)/ Gradient Descent (convex)
Learning rate, ƞ>0 Gradient: Update rule:
Gradient Ascent for Logistic Regression
41Gradient%ascent%algorithm:%iterate%un$l%change%<%ε# # # For%i=1,…,d,%% % % repeat%%%%
Predict%what%current%weight% thinks%label%Y%should%be%
slide by Aarti Singh & Barnabás PóczosPredict what current weight thinks label Y should be
− e.g. Newton method, Conjugate gradient ascent, IRLS (see Bishop 4.3.3)
repeat For i-1,…,d, Gradient ascent algorithm: iterate until change < ɛ
Effect of step-size η
42 slide by Aarti Singh & Barnabás PóczosLarge ƞ → Fast convergence but larger residual error Also possible oscillations Small ƞ → Slow convergence but small residual error
Set(of(Gaussian(( Naïve(Bayes(parameters( (feature(variance(( independent(of(class(label)( Set(of(Logis&c(( Regression(parameters(
slide by Aarti Singh & Barnabás Póczos− But only in a special case!!! (GNB with class-independent
variances)
Naïve Bayes vs. Logistic Regression
Set(of(Gaussian(( Naïve(Bayes(parameters( (feature(variance(( independent(of(class(label)( Set(of(Logis&c(( Regression(parameters(
slide by Aarti Singh & Barnabás Póczos− But only in a special case!!! (GNB with class-independent
variances)
− Optimize different functions! Obtain different solutions
Naïve Bayes vs. Logistic Regression
Naïve Bayes vs. Logistic Regression
45 slide by Aarti Singh & Barnabás PóczosConsider Y Boolean, Xi continuous X=<X1 … Xd> Number of parameters:
Estimation method:
%%%%π, (µ1,y, µ2,y, …, µd,y),%(σ2
1,y, σ2 2,y, …, σ2 d,y)%%%%
%%%%w0,%w1,%…,%wd%
Generative vs. Discriminative
Given infinite data (asymptotically), If conditional independence assumption holds,
Discriminative and generative NB perform similar. If conditional independence assumption does NOT holds, Discriminative outperforms generative NB.
46 slide by Aarti Singh & Barnabás Póczos[Ng & Jordan, NIPS 2001]
Generative vs. Discriminative
47 slide by Aarti Singh & Barnabás Póczos[Ng & Jordan, NIPS 2001]
Given finite data (n data points, d features),
Naïve Bayes (generative) requires n = O(log d) to converge to its asymptotic error, whereas Logistic regression (discriminative) requires n = O(d).
Why? “Independent class conditional densities”
independently, not jointly, from training data.
Verdict
Both learn a linear decision boundary. Naïve Bayes makes more restrictive assumptions and has higher asymptotic error, BUT converges faster to its less accurate asymptotic error.
Naïve Bayes vs. Logistic Regression
Experimental Comparison (Ng-Jordan’01)
49 20 40 60 0.25 0.3 0.35 0.4 0.45 0.5 m error pima (continuous) 10 20 30 0.2 0.25 0.3 0.35 0.4 0.45 0.5 m error adult (continuous) 20 40 60 0.2 0.25 0.3 0.35 0.4 0.45 m error boston (predict if > median price, continuous) 50 100 150 200 0.1 0.2 0.3 0.4 m errorUCI Machine Learning Repository 15 datasets, 8 continuous features, 7 discrete features More in the paper...
slide by Aarti Singh & Barnabás PóczosNaïve Bayes Logistic Regression
What you should know
50 slide by Aarti Singh & Barnabás Póczos− decision rule is a hyperplane
− no closed-form solution − concave ! global optimum with gradient ascent
representationally equivalent to LR
− Solution differs because of objective (loss) function
− NB: Features independent given class! assumption on P(X|Y) − LR: Functional form of P(Y|X), no assumption on P(X|Y)
− GNB (usually) needs less data − LR (usually) gets to better solutions in the limit
Linear Discriminant Functions
51Linear Discriminant Function
where w is called weight vector, and w0 is a bias.
where step function sign(·) is defined as
52y(x) = wTx + w0
C(x) = sign(wTx + w0)
sign(a) = ( +1, a > 0 −1, a < 0
slide by Ce LiuwTx kwk = w0 kwk the decision surface.
Properties of Linear Discriminant Functions
from the origin to the decision surface is
is perpendicular to w, and its displacement from the origin is controlled by the bias parameter w0.
a general point x from the decision surface is given by y(x)/||w||
perpendicular distance r of the point x from the decision surface
slide by Ce LiuProperties of Linear Discriminant Functions
where x⊥ is the projection x on the decision surface. Then
x = x⊥ + r w kwk
wTx = wTx⊥ + rwTw kwk wTx + w0 = wTx⊥ + w0 + rkwk y(x) = rkwk r = y(x) kwk
define e w = (w0, w)
and e x = (1, x)
e y(x) = e wTe x
slide by Ce LiuMultiple Classes: Simple Extension
55R1 R2 R3 ? C1 not C1 C2 not C2
R1 R2 R3 ? C1 C2 C1 C3 C2 C3
not in Ck.
Multiple Classes: K-Class Discriminant
by yk(x) = yj(x)
56yk(x) = wT
k x + wk0
C(x) = k, if yk(x) > yj(x) 8 j 6= k
C C (wk wj)Tx + (wk0 wj0) = 0
slide by Ce LiuProperty of the Decision Regions
57Theorem
The decision regions of the K-class discriminant yk(x) = wT
k x + wk0 are singlyconnected and convex.
Proof.
Suppose two points xA and xB both lie inside decision region Rk. Any point ˆ x on the line between xA and xB can be expressed as ˆ x = λxA + (1 λ)xB So yk(ˆ x) = λyk(xA) + (1 λ)yk(xB) > λyj(xA) + (1 λ)yj(xB) (8 j 6= k) = yj(ˆ x) (8 j 6= k) Therefore, the regions Rk is single connected and convex.
slide by Ce LiuProperty of the Decision Regions
58Theorem
The decision regions of the K-class discriminant yk(x) = wT
k x + wk0 are singlyconnected and convex.
Proof.
Suppose two points xA and xB both lie inside decision region Rk. Any point ˆ x on the line between xA and xB can be expressed as ˆ x = λxA + (1 λ)xB So yk(ˆ x) = λyk(xA) + (1 λ)yk(xB) > λyj(xA) + (1 λ)yj(xB) (8 j 6= k) = yj(ˆ x) (8 j 6= k) Therefore, the regions Rk is single connected and convex.
slide by Ce LiuProperty of the Decision Regions
59ul- decision re- line in be
Ri Rj Rk xA xB ˆ x
Theorem
The decision regions of the K-class discriminant yk(x) = wT
k x + wk0 are singlyconnected and convex.
If two points xA and xB both lie inside the same decision region Rk, then any point x that lies on the line connecting these two points must also lie in Rk, and hence the decision region must be singly connected and convex.
slide by Ce Liuy = wTx
Fisher’s Linear Discriminant
can be maximally separated
60 −2 2 6 −2 2 4 −2 2 6 −2 2 4
Difference of means Fisher’s Linear Discriminant
m1 = 1 N1 X
n∈C1xn, m2 = 1 N2 X
n∈C2xn A way to view a linear classification model is in terms of dimensionality reduction.
slide by Ce Liu⇣ wT(m1 − m2) ⌘2 = wT(m1 − m2)(m1 − m2)Tw = wTSBw
What’s a Good Projection?
where SB = (m1 − m2)(m1 − m2)T is called between-class covariance matrix.
where
61wTSWw
SW = X
n∈C1
(xn − m1)(xn − m1)T + X
n∈C2
(xn − m2)(xn − m2)T
slide by Ce LiuFisher’s Linear Discriminant
about directions. So the scalars are dropped. Therefore
62J(w) = Between-class variance Within-class variance = wTSBw wTSWw
for f(x) = g(x)
h(x)f 0(x) = g0(x)h(x) g(x)h0(x) h2(x)
(wTSBw)SWw = (wTSWw)SBw (wTSBw)SWw = (wTSWw)(m2 m1) ⇣ (m2 m1)Tw ⌘
w / S1
W (m2 m1)
slide by Ce LiuFrom Fisher’s Linear Discriminant to Classifiers
classification problem to 1D.
thresholds lead to nonlinear classifiers). The final classifier has the form where the nonlinear activation function sign(·) is a step function
y(x) = sign(wTx + w0)
· sign(a) = ( +1, a > 0 −1, a < 0
slide by Ce Liu