Logistic Regression
Jia-Bin Huang Virginia Tech
Spring 2019
ECE-5424G / CS-5824
Logistic Regression Jia-Bin Huang Virginia Tech Spring 2019 - - PowerPoint PPT Presentation
Logistic Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative Please start HW 1 early! Questions are welcome! Two principles for estimating parameters Maximum Likelihood Estimate (MLE) Choose
Jia-Bin Huang Virginia Tech
Spring 2019
ECE-5424G / CS-5824
Choose ๐ that maximizes probability of observed data
เทก ๐พMLE = argmax
๐
๐(๐ธ๐๐ข๐|๐)
Choose ๐ that is most probable given prior probability and data
เทก ๐พMAP = argmax
๐
๐ ๐ ๐ธ = argmax
๐
๐ ๐ธ๐๐ข๐ ๐ ๐ ๐ ๐(๐ธ๐๐ข๐)
Slide credit: Tom Mitchell
๐(๐1,โฏ,๐๐ ๐ ๐ ๐ ๐(๐1,โฏ,๐๐)
โ ๐(๐1, โฏ , ๐๐ ๐ ๐ ๐
Need 1 parameter
๐
๐(๐
๐|๐): Need ๐จ ร ๐ parameters
๐ ๐ = ๐ง๐ ๐1, โฏ , ๐๐) = ๐(๐ = ๐ง๐)๐(๐1, โฏ , ๐๐ ๐ = ๐ง๐ ฯ๐ ๐ ๐ = ๐ง๐ ๐ ๐1, โฏ , ๐๐ ๐ = ๐ง๐
๐ ๐ = ๐ง๐ ๐1, โฏ , ๐๐) = ๐ ๐ = ๐ง๐ ฮ ๐๐ ๐๐ ๐ = ๐ง๐) ฯ๐ ๐ ๐ = ๐ง๐ ฮ ๐๐ ๐๐ ๐ = ๐ง๐)
เท ๐ โ argmax
๐ง๐
๐ ๐ = ๐ง๐ ฮ ๐๐ ๐๐ ๐ = ๐ง๐)
Slide credit: Tom Mitchell
Bayes rule Conditional indep. ๐ ๐ = 1 = 0.4 ๐ ๐1 = 1|๐ = 1 = 0.2 ๐ ๐1 = 1|๐ = 0 = 0.7 ๐ ๐2 = 1|๐ = 1 = 0.3 ๐ ๐2 = 1|๐ = 0 = 0.9 ๐ ๐ = 0 = 0.6 ๐ ๐1 = 0|๐ = 1 = 0.8 ๐ ๐1 = 0|๐ = 0 = 0.3 ๐ ๐2 = 0|๐ = 1 = 0.7 ๐ ๐2 = 0|๐ = 0 = 0.1
Estimate ๐๐ = ๐(๐ = ๐ง๐) For each value xij of each attribute Xi
Estimate ๐๐๐๐ = ๐(๐๐ = ๐ฆ๐๐๐|๐ = ๐ง๐)
เท ๐ โ argmax
๐ง๐
๐ ๐ = ๐ง๐ ฮ ๐๐ ๐๐
test ๐ = ๐ง๐)
เท ๐ โ argmax
๐ง๐
๐๐ ฮ ๐๐๐๐๐
Slide credit: Tom Mitchell
เท ๐๐ = เท ๐ ๐ = ๐ง๐ = #๐ธ ๐ = ๐ง๐ ๐ธ แ ๐๐๐๐ = เท ๐ ๐๐ = ๐ฆ๐๐ ๐ = ๐ง๐ = #๐ธ ๐๐ = ๐ฆ๐๐ ^ ๐ = ๐ง๐ #๐ธ{๐ = ๐ง๐}
Slide credit: Tom Mitchell
๐ ๐บ = 1 = ๐ ๐ = 1|๐บ = 1 = ๐ ๐ = 1|๐บ = 0 = ๐ ๐ธ = 1|๐บ = 1 = ๐ ๐ธ = 1|๐บ = 0 = ๐ ๐ป = 1|๐บ = 1 = ๐ ๐ป = 1|๐บ = 0 = ๐ ๐บ = 0 = ๐ ๐ = 0|๐บ = 1 = ๐ ๐ = 0|๐บ = 0 = ๐ ๐ธ = 0|๐บ = 1 = ๐ ๐ธ = 0|๐บ = 0 = ๐ ๐ป = 0|๐บ = 1 = ๐ ๐ป = 0|๐บ = 0 = ๐ ๐บ|๐, ๐ธ, ๐ป = ๐ ๐บ P S F P D F P(G|F)
[Domingos & Pazzani, 1996])
๐ ๐ = ๐ง๐ ๐1, โฏ , ๐๐) โ ๐ ๐ = ๐ง๐ ฮ ๐๐ ๐๐ ๐ = ๐ง๐)
Slide credit: Tom Mitchell
MLE estimate for ๐ ๐๐ ๐ = ๐ง๐) might be zero. (for example, ๐๐ = birthdate. ๐๐ = Feb_4_1995)
๐ ๐ = ๐ง๐ ๐1, โฏ , ๐๐) โ ๐ ๐ = ๐ง๐ ฮ ๐๐ ๐๐ ๐ = ๐ง๐)
Slide credit: Tom Mitchell
เท ๐๐ = เท ๐ ๐ = ๐ง๐ = #๐ธ ๐ = ๐ง๐ ๐ธ แ ๐๐๐๐ = เท ๐ ๐๐ = ๐ฆ๐๐ ๐ = ๐ง๐ = #๐ธ ๐๐ = ๐ฆ๐๐, ๐ = ๐ง๐ #๐ธ{๐ = ๐ง๐}
เท ๐๐ = เท ๐ ๐ = ๐ง๐ = #๐ธ ๐ = ๐ง๐ + (๐พ๐โ1) ๐ธ + ฯ๐(๐พ๐โ1) แ ๐๐๐๐ = เท ๐ ๐๐ = ๐ฆ๐๐ ๐ = ๐ง๐ = #๐ธ ๐๐ = ๐ฆ๐๐, ๐ = ๐ง๐ + (๐พ๐ โ1) #๐ธ{๐ = ๐ง๐} + ฯ๐(๐พ๐โ1)
Slide credit: Tom Mitchell
๐ ๐๐ = ๐ฆ ๐ = ๐ง๐ = 1 2๐๐๐๐ exp(โ ๐ฆ โ ๐๐๐ 2๐๐๐
2 2
)
Slide credit: Tom Mitchell
Estimate ๐๐ = ๐(๐ = ๐ง๐) For each attribute Xi estimate Class conditional mean ๐๐๐, variance ๐๐๐
เท ๐ โ argmax
๐ง๐
๐ ๐ = ๐ง๐ ฮ ๐๐ ๐๐
test ๐ = ๐ง๐)
เท ๐ โ argmax
๐ง๐
๐๐ ฮ ๐ ๐๐๐ ๐๐๐(๐๐
test, ๐๐๐, ๐๐๐)
Slide credit: Tom Mitchell
๐ ๐ = ๐ง๐ ๐1, โฏ , ๐๐) โ ๐ ๐ = ๐ง๐ ฮ ๐๐ ๐๐ ๐ = ๐ง๐)
Malignant? 0 (No) 1 (Yes) Tumor Size โ๐ ๐ฆ = ๐โค๐ฆ
Slide credit: Andrew Ng
Slide credit: Andrew Ng
where ๐ ๐จ =
1 1+๐โ๐จ
๐จ ๐(๐จ)
Slide credit: Andrew Ng
x1 = 1 tumorSize
Slide credit: Andrew Ng
โ๐ ๐ฆ = ๐ ๐โค๐ฆ ๐ ๐จ = 1 1 + ๐โ๐จ Suppose predict โy = 1โ if โ๐ ๐ฆ โฅ 0.5 ๐จ = ๐โค๐ฆ โฅ 0 predict โy = 0โ if โ๐ ๐ฆ < 0.5 ๐จ = ๐โค๐ฆ < 0
๐จ = ๐โค๐ฆ ๐(๐จ)
Slide credit: Andrew Ng
E.g., ๐0 = โ3, ๐1= 1, ๐2 = 1
Tumor Size Age
Slide credit: Andrew Ng
+ ๐3๐ฆ1
2 + ๐4๐ฆ2 2)
E.g., ๐0 = โ1, ๐1= 0, ๐2 = 0, ๐3 = 1, ๐4 = 1
2 + ๐ฆ2 2 โฅ 0
2 +
๐4๐ฆ1
2๐ฆ2 + ๐5๐ฆ1 2๐ฆ2 2 + ๐6๐ฆ1 3๐ฆ2 + โฏ )
Slide credit: Andrew Ng
โ๐ ๐ฆ = 1 1 + ๐โ๐โค๐ฆ = 1 1 + ๐โ(๐0+๐1๐ฆ1+๐2๐ฆ2+โฏ+๐๐๐ฆ๐)
What is ๐ ๐|๐1, ๐2, โฏ , ๐๐ ?
Slide credit: Tom Mitchell
๐ ๐=1 ๐(๐|๐=1) ๐ ๐=1 ๐(๐|๐=1) +๐(๐=0)๐(๐|๐=0)
=
1 1+๐ ๐=0 ๐(๐|๐=0)
๐ ๐=1 ๐(๐|๐=1)
=
1 1+exp(ln(๐ ๐=0 ๐(๐|๐=0)
๐ ๐=1 ๐(๐|๐=1) ))
=
1 1+exp(ln 1โ๐
๐
+ฯ๐ ๐๐
๐(๐๐|๐=0) ๐(๐๐|๐=1))
๐ ๐ = 1|๐1, ๐2, โฏ , ๐๐ = 1 1 + exp(๐0 + ฯ๐ ๐๐ ๐๐)
เท
๐
(๐๐0 โ ๐๐1 ๐๐
2
๐๐ + ๐๐1
2 โ ๐๐0 2
2๐๐
2
)
Applying Bayes rule Divide by ๐ ๐ = 1 ๐(๐|๐ = 1) Apply exp(ln(โ )) Plug in ๐(๐๐|๐)
๐ ๐ฆ|๐ง๐ = 1 2๐๐๐ ๐
โ ๐ฆโ๐๐๐ 2 2๐๐
2 Slide credit: Tom Mitchell
Slide credit: Andrew Ng
๐พ ๐ = 1 2๐ เท
๐=1 ๐
โ๐ ๐ฆ ๐ โ ๐ง ๐
2 = 1
๐ เท
๐=1 ๐
Cost(โ๐(๐ฆ ๐ ), ๐ง))
Slide credit: Andrew Ng
Cost(โ๐ ๐ฆ , ๐ง) = เตโlog โ๐ ๐ฆ if ๐ง = 1 โlog 1 โ โ๐ ๐ฆ if ๐ง = 0
1
โ๐ ๐ฆ 1
โ๐ ๐ฆ
Slide credit: Andrew Ng
if ๐ง = 1 โlog 1 โ โ๐ ๐ฆ if ๐ง = 0
โ (1 โ y) log 1 โ โ๐ ๐ฆ
Slide credit: Andrew Ng
๐พ ๐ = 1 ๐ เท
๐=1 ๐
Cost(โ๐(๐ฆ ๐ ), ๐ง(๐))) = โ
1 ๐ ฯ๐=1 ๐ ๐ง(๐) log โ๐ ๐ฆ(๐)
+ (1 โ ๐ง(๐)) log 1 โ โ๐ ๐ฆ(๐)
Prediction: given new ๐ฆ Output โ๐ ๐ฆ =
1 1+๐โ๐โค๐ฆ
๐ ๐พ(๐)
Slide credit: Andrew Ng
๐ฆ 1 , ๐ง 1 , ๐ฆ 2 , ๐ง 2 , โฏ , ๐ฆ ๐ , ๐ง ๐
๐MLE = argmax
๐
๐๐ ๐ฆ 1 , ๐ง 1 , ๐ฆ 2 , ๐ง 2 , โฏ , ๐ฆ ๐ , ๐ง ๐ = argmax
๐
เท
๐=1 ๐
๐๐ ๐ฆ ๐ , ๐ง ๐
Slide credit: Tom Mitchell
1 1+๐โ๐โค๐ฆ
๐โ๐โค๐ฆ 1+๐โ๐โค๐ฆ
๐ฆ 1 , ๐ง 1 , ๐ฆ 2 , ๐ง 2 , โฏ , ๐ฆ ๐ , ๐ง ๐
๐ ๐๐
๐ฆ ๐ , ๐ง ๐
๐ ๐๐ ๐ง(๐)|๐ฆ ๐
๐
๐ ๐๐ ๐ง(๐)|๐ฆ ๐
Slide credit: Tom Mitchell
๐ ๐ = log เท
๐=1 ๐
๐๐ ๐ง(๐)|๐ฆ ๐ = เท
๐=1 ๐
log ๐๐ ๐ง(๐)|๐ฆ ๐ = เท
๐=1 ๐
๐ง(๐) log ๐๐ ๐ง(๐) = 1|๐ฆ ๐ + 1 โ ๐ง ๐ log ๐๐ ๐ง(๐) = 0|๐ฆ ๐ = ฯ๐=1
๐ ๐ง(๐) log (โ๐(๐ฆ(๐))) + 1 โ ๐ง ๐
log(1 โ โ๐(๐ฆ(๐)))
Cost(โ๐ ๐ฆ , ๐ง) = เตโlog โ๐ ๐ฆ if ๐ง = 1 โlog 1 โ โ๐ ๐ฆ if ๐ง = 0
๐พ ๐ = โ 1 ๐ เท
๐=1 ๐
๐ง(๐) log โ๐ ๐ฆ(๐) + (1 โ ๐ง(๐)) log 1 โ โ๐ ๐ฆ(๐)
Goal: min
๐ ๐พ(๐)
Repeat { ๐
๐ โ ๐ ๐ โ ๐ฝ ๐
๐๐
๐
๐พ(๐) }
(Simultaneously update all ๐
๐)
๐ ๐๐
๐
๐พ ๐ = 1 ๐ เท
๐=1 ๐
(โ๐ ๐ฆ ๐ โ ๐ง(๐)) ๐ฆ๐
(๐)
Good news: Convex function! Bad news: No analytical solution
Slide credit: Andrew Ng
๐พ ๐ = โ 1 ๐ เท
๐=1 ๐
๐ง(๐) log โ๐ ๐ฆ(๐) + (1 โ ๐ง(๐)) log 1 โ โ๐ ๐ฆ(๐)
Goal: min
๐ ๐พ(๐)
Repeat { ๐
๐ โ ๐ ๐ โ ๐ฝ 1
๐ เท
๐=1 ๐
โ๐ ๐ฆ ๐ โ ๐ง(๐) ๐ฆ๐
(๐)
}
(Simultaneously update all ๐
๐)
Slide credit: Andrew Ng
Repeat { ๐
๐ โ ๐ ๐ โ ๐ฝ 1
๐ เท
๐=1 ๐
โ๐ ๐ฆ ๐ โ ๐ง(๐) ๐ฆ๐
(๐)
}
Repeat { ๐
๐ โ ๐ ๐ โ ๐ฝ 1
๐ เท
๐=1 ๐
โ๐ ๐ฆ ๐ โ ๐ง(๐) ๐ฆ๐
(๐)
}
Slide credit: Andrew Ng
๐
๐ ๐๐ ๐ง(๐)|๐ฆ ๐
๐
๐ ๐๐ ๐ง(๐)|๐ฆ ๐
Slide credit: Tom Mitchell
๐
๐ โ ๐ ๐ โ ๐ฝ 1
๐ เท
๐=1 ๐
โ๐ ๐ฆ ๐ โ ๐ง(๐) ๐ฆ๐
(๐)
๐
๐ โ ๐ ๐ โ ๐ฝ๐๐ ๐ โ ๐ฝ 1
๐ เท
๐=1 ๐
โ๐ ๐ฆ ๐ โ ๐ง(๐) ๐ฆ๐
(๐)
Slide credit: Andrew Ng
๐ฆ2 ๐ฆ1
๐ฆ2 ๐ฆ1
๐ฆ2 ๐ฆ1
Class 1: Class 2: Class 3:
โ๐
๐ ๐ฆ = ๐ ๐ง = ๐ ๐ฆ; ๐
(๐ = 1, 2, 3) ๐ฆ2 ๐ฆ1 ๐ฆ2 ๐ฆ1 ๐ฆ2 ๐ฆ1
โ๐
1 ๐ฆ
โ๐
2 ๐ฆ
โ๐
3 ๐ฆ
Slide credit: Andrew Ng
๐ ๐ฆ for
i
๐ ๐ฆ
Slide credit: Andrew Ng
Ex: Naรฏve Bayes
Estimate ๐(๐) and ๐(๐|๐) Prediction
เท ๐ง = argmax๐ง ๐ ๐ = ๐ง ๐(๐ = ๐ฆ|๐ = ๐ง)
Ex: Logistic regression
Estimate ๐(๐|๐) directly (Or a discriminant function: e.g., SVM) Prediction
เท ๐ง = ๐(๐ = ๐ง|๐ = ๐ฆ)
Generative and discriminative classifiers: Naรฏve Bayes and Logistic Regression http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes http://papers.nips.cc/paper/2020-on-discriminative-vs-generative- classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf
โ๐ ๐ฆ = 1 1 + ๐โ๐โค๐ฆ Cost(โ๐ ๐ฆ , ๐ง) = เตโlog โ๐ ๐ฆ if ๐ง = 1 โlog 1 โ โ๐ ๐ฆ if ๐ง = 0 ๐
๐ โ ๐ ๐ โ ๐ฝ 1
๐ เท
๐=1 ๐
โ๐ ๐ฆ ๐ โ ๐ง(๐) ๐ฆ๐
(๐)
๐
๐ โ ๐ ๐ โ ๐ฝ๐๐ ๐ โ ๐ฝ 1
๐ เท
๐=1 ๐
โ๐ ๐ฆ ๐ โ ๐ง(๐) ๐ฆ๐
(๐)
max
i
โ๐
๐ ๐ฆ