machine learning mt 2016 8 classification logistic
play

Machine Learning - MT 2016 8. Classification: Logistic Regression - PowerPoint PPT Presentation

Machine Learning - MT 2016 8. Classification: Logistic Regression Varun Kanade University of Oxford November 2, 2016 Logistic Regression Logistic Regression is actually a classification method In its simplest form it is a binary (two classes)


  1. Machine Learning - MT 2016 8. Classification: Logistic Regression Varun Kanade University of Oxford November 2, 2016

  2. Logistic Regression Logistic Regression is actually a classification method In its simplest form it is a binary (two classes) classification method ◮ Today’s Lecture: We’ll denote these by 0 and 1 ◮ Next Week: Sometimes it’s more convenient to call them − 1 and +1 ◮ Ultimately, the choice is just for mathematical convenience It is a discriminative method. We only model: p ( y | w , x ) 1

  3. Logistic Regression (LR) ◮ LR builds up on a linear model, composed with a sigmoid function p ( y | w , x ) = Bernoulli(sigmoid( w · x )) ◮ Z ∼ Bernoulli( θ ) � 1 with probability θ Z = 0 with probability 1 − θ ◮ Recall that the sigmoid function is defined by: 1 sigmoid( t ) = 1 + e − t 1 Sigmoid 0 . 8 0 . 6 0 . 4 0 . 2 0 − 4 − 2 0 2 4 t ◮ As we did in the case of linear models, we assume x 0 = 1 for all datapoints, so we do not need to handle the bias term w 0 separately 2

  4. Prediction Using Logistic Regression Suppose we have estimated the model parameters w ∈ R D For a new datapoint x new , the model gives us the probability 1 p ( y new = 1 | x new , w ) = sigmoid( w · x new ) = 1 + exp( − x new · w ) In order to make a prediction we can simply use a threshold at 1 2 y new = I (sigmoid( w · x new )) ≥ 1 � 2) = I ( w · x new ≥ 0) Class boundary is linear (separating hyperplane) 3

  5. Prediction Using Logistic Regression 4

  6. Likelihood of Logistic Regression i =1 , where x i ∈ R D and y i ∈ { 0 , 1 } Data D = � ( x i , y i ) � N Let us denote the sigmoid function by σ We can write the likelihood for of observing the data given model parameters w as: � N σ ( w T x i ) y i · (1 − σ ( w T x i )) 1 − y i p ( y | X , w ) = i =1 Let us denote µ i = σ ( w T x i ) We can write the negative log-likelihood as: N � NLL( y | X , w ) = − ( y i log µ i + (1 − y i ) log(1 − µ i )) i =1 5

  7. Likelihood of Logistic Regression Recall that µ i = σ ( w T x i ) and the negative log-likelihood is N � NLL( y | X , w ) = − ( y i log µ i + (1 − y i ) log(1 − µ i )) i =1 Let us focus on a single datapoint, the contribution to the negative log-likelihood is NLL( y i | x i , w ) = − ( y i log µ i + (1 − y i ) log(1 − µ i )) This is basically the cross-entropy between y i and µ i If y i = 1 , then as ◮ As µ i → 1 , NLL( y i | x i , w ) → 0 ◮ As µ i → 0 , NLL( y i | x i , w ) → ∞ 6

  8. Maximum Likelihood Estimate for LR Recall that µ i = σ ( w T x i ) and the negative log-likelihood is N � NLL( y | X , w ) = − ( y i log µ i + (1 − y i ) log(1 − µ i )) i =1 We can take the gradient with respect to w N � x i ( µ i − y i ) = X T ( µ − y ) ∇ w NLL( y | X , w ) = i =1 And the Hessian is given by, H = X T SX S is a diagonal matrix where S ii = µ i (1 − µ i ) 7

  9. Iteratively Re-Weighted Least Squares (IRLS) Depending on the dimension, we can apply Newton’s method to estimate w Let w t be the parameters after t Newton steps. The gradient and Hessian are given by: g t = X T ( µ t − y ) = − X T ( y − µ t ) H t = X T S t X The Newton Update Rule is: w t +1 = w t − H − 1 t g t = w t + ( X T S t X ) − 1 X T ( y − µ t ) = ( X T S t X ) − 1 X T S t ( Xw t + S − 1 t ( y − µ t )) = ( X T S t X ) − 1 X T S t z t Where z t = Xw t + S − 1 t ( y − µ t ) . Then w t +1 is a solution of the following: Weighted Least Squares Problem N � S t,ii ( z t,i − w T x i ) 2 minimise i =1 8

  10. Multiclass Logistic Regression Multiclass logistic regression is also a discriminative classifier Let the inputs be x ∈ R D and y ∈ { 1 , . . . , C } There are parameters w c ∈ R D for every class c = 1 , . . . , C We’ll put this together in a matrix form W that is D × C The multiclass logistic model is given by: exp( w T c x ) p ( y = c | x , W ) = � C c ′ =1 exp( w T c ′ x ) 9

  11. Multiclass Logistic Regression The multiclass logistic model is given by: exp( w T c x ) p ( y = c | x , W ) = � C c ′ =1 exp( w T c ′ x ) Recall the softmax function Softmax Softmax maps a set of numbers to a probability distribution with mode at the maximum � e a 1 � T � [ a 1 , . . . , a C ] T � Z , . . . , e a C softmax = Z C � e a c . where Z = c =1 The multiclass logistic model is simply: �� � T � w T 1 x , . . . , w T p ( y | x , W ) = softmax C x 10

  12. Multiclass Logistic Regression 11

  13. Summary: Logistic Regression ◮ Logistic Regression is a (binary) classification method ◮ It is a discriminative model ◮ Extension to multiclass by replacing sigmoid by softmax ◮ Can derive Maximum Likelihood Estimates using Convex Optimization ◮ See Chap 8.3 in Murphy (for multiclass), but we’ll revisit as a form of a neural network 12

  14. Next Week ◮ Suppor Vector Machines ◮ Kernel Methods ◮ Revise Linear Programming and Convex Optimisation 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend