Spring 2020
Mehdi Allahyari Georgia Southern University
(slides borrowed from Tom Mitchell, Barnabás Póczos & Aarti Singh
1
Logistic Regression (slides borrowed from Tom Mitchell, Barnabs - - PowerPoint PPT Presentation
CSCI 4520 - Introduction to Machine Learning Spring 2020 Mehdi Allahyari Georgia Southern University Logistic Regression (slides borrowed from Tom Mitchell, Barnabs Pczos & Aarti Singh 1 Linear Regression & Linear Classification
Spring 2020
Mehdi Allahyari Georgia Southern University
(slides borrowed from Tom Mitchell, Barnabás Póczos & Aarti Singh
1
2
Weight% Height%
3
4
7%
%
% But% %arg%max_Y%P(X|Y)%P(Y)%=%arg%max_Y%P(Y|X)% %
%
%
%
5
6
7
Derive form for P(Y|X) for Gaussian P(Xi|Y=yk) assuming σik = σi
8
9
10
11
12
13
14
15
16
17
Gradient:( Learning(rate,(η>0( Update(rule:(
18
Batch gradient: use error over entire training set D
Do until satisfied:
Stochastic gradient: use error over single examples
Do until satisfied:
Stochastic approximates Batch arbitrarily closely as Stochastic can be much faster when D is very large Intermediate approach: use error over subsets of D
19
20
21
22
23
24
25
26
Training classifiers involves estimating f: X ! Y, or P(Y|X) Generative classifiers (e.g., Naïve Bayes)
Discriminative classifiers (e.g., Logistic regression)
27
28
29
30
Which method works better if we have infinite training data, and…
31
Recall two assumptions deriving form of LR from GNBayes:
Consider three learning methods:
assumption 1.
Which method works better if we have infinite training data, and...
[Ng & Jordan, 2002]
32
What if we have only finite training data? They converge at different rates to their asymptotic (∞ data) error Let refer to expected error of learning algorithm A after n training examples Let d be the number of features: <X1 … Xd> So, GNB requires n = O(log d) to converge, but LR requires n = O(d)
[Ng & Jordan, 2002]
33
Naïve bayes Logistic Regression
34
35
– Functional form follows from Naïve Bayes assumptions
– But training procedure picks parameters without making conditional independence assumption – MLE training: pick W to maximize P(Y | X, W) – MAP training: pick W to maximize P(W | X,Y)
– General approach when closed-form solutions unavailable
– Bias vs. variance tradeoff
36
– Solution differs because of objective (loss) function
– NB: Features independent given class ! assumption on P(X|Y) – LR: Functional form of P(Y|X), no assumption on P(X|Y)
– decision rule is a hyperplane
– no closed-form solution – concave ! global optimum with gradient ascent – Maximum conditional a posteriori corresponds to regularization
– GNB (usually) needs less data – LR (usually) gets to better solutions in the limit