 
              Logistic Regression Machine Learning 1
Where are we? We have seen the following ideas – Linear models – Learning as loss minimization – Bayesian learning criteria (MAP and MLE estimation) – The Naïve Bayes classifier 2
This lecture • Logistic regression • Connection to Naïve Bayes • Training a logistic regression classifier • Back to loss minimization 3
This lecture • Logistic regression • Connection to Naïve Bayes • Training a logistic regression classifier • Back to loss minimization 4
Logistic Regression: Setup • The setting – Binary classification – Inputs: Feature vectors x 2 < d – Labels: y y 2 { -1 , +1 } • Training data – S = { (x (x i , y i ) }, m examples 5
Classification, but… The output y is discrete valued ( -1 or 1 ) Instead of predicting the output, let us try to predict P(y = 1 | x ) Expand hypothesis space to functions whose output is [0-1] • Original problem: < d ! { -1 , 1 } • Modified problem: < d ! [0-1] • Effectively make the problem a regression problem Many hypothesis spaces possible 6
Classification, but… The output y is discrete valued ( -1 or 1 ) Instead of predicting the output, let us try to predict P(y = 1 | x ) Expand hypothesis space to functions whose output is [0-1] • Original problem: < d ! { -1 , 1 } • Modified problem: < d ! [0-1] • Effectively make the problem a regression problem Many hypothesis spaces possible 7
The Sigmoid function The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed with a sigmoid function (the logistic function) ¾ What is the domain and the range of the sigmoid function? This is a reasonable choice. We will see why later 8
The Sigmoid function The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed with a sigmoid function (the logistic function) ¾ This is a reasonable choice. We will see why later 9
The Sigmoid function The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed with a sigmoid function (the logistic function) ¾ What is the domain and the range of the sigmoid function? This is a reasonable choice. We will see why later 10
The Sigmoid function ¾ ( z ) z 11
The Sigmoid function What is its derivative with respect to z ? 12
The Sigmoid function What is its derivative with respect to z ? 13
Predicting probabilities According to the logistic regression model, we have 14
Predicting probabilities According to the logistic regression model, we have 15
Predicting probabilities According to the logistic regression model, we have 16
Predicting probabilities According to the logistic regression model, we have Or equivalently 17
Predicting probabilities According to the logistic regression model, we have Note that we are directly modeling Or equivalently 𝑄(𝑧 | 𝑦) rather than 𝑄(𝑦 |𝑧) and 𝑄(𝑧) 18
Predicting a label with logistic regression • Compute P(y =1 | x; w) • If this is greater than half, predict 1 else predict -1 – What does this correspond to in terms of w T x ? 19
Predicting a label with logistic regression • Compute P(y =1 | x; w) • If this is greater than half, predict 1 else predict -1 – What does this correspond to in terms of w T x ? – Prediction = sgn( w T x ) 20
This lecture • Logistic regression • Connection to Naïve Bayes • Training a logistic regression classifier • Back to loss minimization 21
Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function log 𝑄(𝑧 = −1|𝐲, 𝐱) 𝑄(𝑧 = +1|𝐲, 𝐱) = 𝐱 2 𝐲 Here, the P’s represent the Naïve Bayes posterior distribution, and w can be used to calculate the priors and the likelihoods. That is, 𝑄(𝑧 = 1 | 𝐱, 𝐲) is computed using 𝑄(𝐲 | 𝑧 = 1, 𝐱) and 𝑄(𝑧 = 1 | 𝐱) 22
Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function log 𝑄(𝑧 = −1|𝐲, 𝐱) 𝑄(𝑧 = +1|𝐲, 𝐱) = 𝐱 2 𝐲 But we also know that 𝑄 𝑧 = +1 𝐲, 𝐱 = 1 − 𝑄(𝑧 = −1|𝐲, 𝐱) 23
Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function log 𝑄(𝑧 = −1|𝐲, 𝐱) 𝑄(𝑧 = +1|𝐲, 𝐱) = 𝐱 2 𝐲 But we also know that 𝑄 𝑧 = +1 𝐲, 𝐱 = 1 − 𝑄(𝑧 = −1|𝐲, 𝐱) Substituting in the above expression, we get 1 𝑄 𝑧 = +1 𝐱, 𝐲 = 𝜏 𝐱 2 𝐲 = (−𝐱 2 𝐲) 1 + exp 24
Naïve Bayes and Logistic regression Remember that the naïve Bayes decision is a linear function log 𝑄(𝑧 = −1|𝐲, 𝐱) 𝑄(𝑧 = +1|𝐲, 𝐱) = 𝐱 2 𝐲 That is, both naïve Bayes and logistic regression try to compute the same posterior distribution over the outputs But we also know that 𝑄 𝑧 = +1 𝐲, 𝐱 = 1 − 𝑄(𝑧 = −1|𝐲, 𝐱) Naïve Bayes is a generative model. Substituting in the above expression, we get Logistic Regression is the discriminative version. 1 𝑄 𝑧 = +1 𝐱, 𝐲 = 𝜏 𝐱 2 𝐲 = (−𝐱 2 𝐲) 1 + exp 25
This lecture • Logistic regression • Connection to Naïve Bayes • Training a logistic regression classifier – First: Maximum likelihood estimation – Then: Adding priors à Maximum a Posteriori estimation • Back to loss minimization 26
Maximum likelihood estimation Let’s get back to the problem of learning • Training data – S = { (x (x i , y i ) }, m examples • What we want – Find a w such that P(S | w ) is maximized – We know that our examples are drawn independently and are identically distributed (i.i.d) – How do we proceed? 27
Maximum likelihood estimation = argmax 𝑄 𝑇 𝐱 = argmax ; 𝑄 𝑧 < 𝐲 < , 𝐱) 𝐱 𝐱 <>? The usual trick: Convert products to sums by taking log Recall that this works only because log is an increasing function and the maximizer will not change 28
Maximum likelihood estimation = argmax 𝑄 𝑇 𝐱 = argmax ; 𝑄 𝑧 < 𝐲 < , 𝐱) 𝐱 𝐱 <>? Equivalent to solving = max 𝐱 @ log 𝑄 𝑧 < 𝐲 < , 𝐱) < 29
Maximum likelihood estimation = argmax 𝑄 𝑇 𝐱 = argmax ; 𝑄 𝑧 < 𝐲 < , 𝐱) 𝐱 𝐱 <>? = max 𝐱 @ log 𝑄 𝑧 < 𝐲 < , 𝐱) < But (by definition) we know that 1 𝑄 𝑧 𝐱, 𝐲 = 𝜏 𝑧 < 𝐱 2 𝐲 < = (−𝑧 < 𝐱 2 𝐲 < ) 1 + exp 30
1 𝑄 𝑧 𝐱, 𝐲 = (−y B 𝐱 2 𝐲 < ) 1 + exp Maximum likelihood estimation = argmax 𝑄 𝑇 𝐱 = argmax ; 𝑄 𝑧 < 𝐲 < , 𝐱) 𝐱 𝐱 <>? = max 𝐱 @ log 𝑄 𝑧 < 𝐲 < , 𝐱) < Equivalent to solving = (−𝑧 < 𝐱 2 𝐲 < ) max 𝐱 @ −log(1 + exp < 31
1 𝑄 𝑧 𝐱, 𝐲 = (−y B 𝐱 2 𝐲 < ) 1 + exp Maximum likelihood estimation = argmax 𝑄 𝑇 𝐱 = argmax ; 𝑄 𝑧 < 𝐲 < , 𝐱) 𝐱 𝐱 <>? = max 𝐱 @ log 𝑄 𝑧 < 𝐲 < , 𝐱) The goal : Maximum likelihood training of a < discriminative probabilistic classifier Equivalent to solving under the logistic model for the posterior = distribution. (−𝑧 < 𝐱 2 𝐲 < ) max 𝐱 @ −log(1 + exp < 32
1 𝑄 𝑧 𝐱, 𝐲 = (−y B 𝐱 2 𝐲 < ) 1 + exp Maximum likelihood estimation = argmax 𝑄 𝑇 𝐱 = argmax ; 𝑄 𝑧 < 𝐲 < , 𝐱) 𝐱 𝐱 <>? = max 𝐱 @ log 𝑄 𝑧 < 𝐲 < , 𝐱) The goal : Maximum likelihood training of a < discriminative probabilistic classifier Equivalent to solving under the logistic model for the posterior = distribution. (−𝑧 < 𝐱 2 𝐲 < ) max 𝐱 @ −log(1 + exp < Equivalent to: Training a linear classifier by minimizing the logistic loss . 33
� Maximum a posteriori estimation We could also add a prior on the weights Suppose each weight in the weight vector is drawn independently from the normal distribution with zero mean and standard deviation 𝜏 E E J 1 exp −𝑥 < 𝑞 𝐱 = ; 𝑞(𝑥 < ) = ; 𝜏 J 𝜏 2𝜌 F>? F>? 34
� MAP estimation for logistic regression E E J 1 exp −𝑥 < 𝑞 𝐱 = ; 𝑞(𝑥 < ) = ; 𝜏 J 𝜏 2𝜌 F>? F>? Let us work through this procedure again to see what changes 35
� MAP estimation for logistic regression E E J 1 exp −𝑥 < 𝑞 𝐱 = ; 𝑞(𝑥 < ) = ; 𝜏 J 𝜏 2𝜌 F>? F>? Let us work through this procedure again to see what changes What is the goal of MAP estimation? (In maximum likelihood, we maximized the likelihood of the data) 36
� MAP estimation for logistic regression E E J 1 exp −𝑥 < 𝑞 𝐱 = ; 𝑞(𝑥 < ) = ; 𝜏 J 𝜏 2𝜌 F>? F>? What is the goal of MAP estimation? (In maximum likelihood, we maximized the likelihood of the data) To maximize the posterior probability of the model given the data (i.e. to find the most probable model, given the data) 𝑄 𝐱 𝑇 ∝ 𝑄 𝑇 𝐱 𝑄(𝐱) 37
Recommend
More recommend