Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015 - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015 - - PowerPoint PPT Presentation

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Agenda Probabilistic Classification Introduction to Logistic regression Binary logistic regression Logistic


slide-1
SLIDE 1

Machine Learning

Logistic Regression

Hamid R. Rabiee

Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 /

slide-2
SLIDE 2

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

2

Agenda Agenda

 Probabilistic Classification  Introduction to Logistic regression  Binary logistic regression  Logistic regression: Decision surface  Logistic regression: ML estimation  Logistic regression: Gradient descent  Logistic regression: multi-class  Logistic Regression: Regularization  Logistic Regression VS. Naïve Bayes

slide-3
SLIDE 3

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

3

Probab Probabil ilis isti tic C c Classi lassifi ficati cation

  • n

 Generative probabilistic classification (Previous lecture)

 motivation: assume a distribution for each class and try to find the parameters for the distributions  cons: need to assume distributions; need to fit many parameters

 Discriminative approach: Logistic regression (Focus of today)

 motivation: like least square, but assume logistic distribution y(x) = (wTx); classify based on y(x) > 0:5 or not.  technique: gradient descent

slide-4
SLIDE 4

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

4

Int Introducti roduction to

  • n to Logisti

Logistic r c regression egression

 Logistic regression represents the probability of category i using a linear function of the input variables:  The name comes from the logit transformation:

slide-5
SLIDE 5

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

5

Bi Binary logist nary logistic regressi ic regression

  • n

 Logistic Regression assumes a parametric form for the distribution then directly estimates its parameters from the training data. The parametric model assumed by Logistic Regression in the case where is boolean is:  Notice that equation (2) follows directly from equation (1), because the sum of these two probabilities must equal 1.

( | ) P Y X

Y

slide-6
SLIDE 6

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

6

Bi Binary logist nary logistic regressi ic regression

  • n

 We only need one set of parameters:  Sigmoid (logistic) function

slide-7
SLIDE 7

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

7

Adapted from slides of John Whitehead

Logisti Logistic r c regression egression vs. Linear

  • vs. Linear r

regression egression

slide-8
SLIDE 8

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

8

Logisti Logistic r c regression: egression: Decisi Decision surf

  • n surface

ace

 Given a logistic regression W and an X:  Decision surface 𝑔(𝒚;𝒙)=constant  Decision surfaces are linear functions of 𝒚  Decision making on Y:

slide-9
SLIDE 9

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

9

Computing the likelihood in details

 We can re-express the log of the conditional likelihood as:

1 1

( ) ln ( 1| , ) (1 )ln ( 0| , ) ( 1| , ) ln ln ( 0| , ) ( 0| , ) ( ) ln(1 exp( ))

l l l l l l l l l l l l l l l n n l l l i i i i l i i

l y P y y P y P y y P y P y y w w x w w x

 

              

    

w x w x w x w x w x w

slide-10
SLIDE 10

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

10

Logistic regression: ML estimation

is a concave in w No closed form solution What is a concave and a convex function?

slide-11
SLIDE 11

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

11

Opti ptimi mizing co zing concav ncave/convex e/convex functi function

  • n

 Maximum of a concave function = minimum of a convex function  Gradient ascent (concave) / Gradient descent (convex)

slide-12
SLIDE 12

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

12

Gradi radient a ent ascen scent t / G / Gradi radient d ent desce escent nt

 For function f(w)

 If f is concave : Gradient ascent rule  If f is convex: Gradient descent rule

slide-13
SLIDE 13

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

13

Logistic regression: Gradient descent

 Iteratively updating the weights in this fashion increases likelihood each round.  We eventually reach the maximum  We are near the maximum when changes in the weights are small.  Thus, we can stop when the sum of the absolute values of the weight differences is less than some small number.

slide-14
SLIDE 14

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

14

Logistic regression: multi-class

 In the two-class case  For multiclass, we work with soft-max function instead of logistic sigmoid  Aka Softmax

slide-15
SLIDE 15

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

15

Logisti Logistic R c Regression: egression: Regulari Regularizati zation

  • n

 Overfitting the training data is a problem that can arise in Logistic Regression, especially when data has very high dimensions and is sparse.  One approach to reducing overfitting is regularization, in which we create a modified “penalized log likelihood function,” which penalizes large values of w.  The derivative of this penalized log likelihood function is similar to our earlier derivative, with one additional penalty term  which gives us the modified gradient descent rule

2

argmax ln ( | , ) || || 2

l l l

P y  

w

w = x w w

ˆ ( ) ( ( 1| , ))

l l l l i i l i

l x y P y w w       

w x w

ˆ ( ( 1| , ))

l l l l i i i i l

w w x y P y w       

x w

slide-16
SLIDE 16

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

16

Logisti Logistic R c Regression VS. egression VS. N Naïve Bayes aïve Bayes

 In general, NB and LR make different assumptions

 NB: Features independent given class -> assumption on P(X|Y)  LR: Functional form of P(Y|X), no assumption on P(X|Y)

 LR is a linear classifier

 decision rule is a hyperplane

 LR optimized by conditional likelihood

 no closed-form solution  concave -> global optimum with gradient ascent

slide-17
SLIDE 17

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

17

Logisti Logistic R c Regression VS. egression VS. N Naïve Bayes aïve Bayes

 Consider Y and Xi boolean, X=<X1... Xn>  Number of parameters:

 NB: 2n +1  LR: n+1

 Estimation method:

 NB parameter estimates are uncoupled  LR parameter estimates are coupled

slide-18
SLIDE 18

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

18

Logistic Regression VS. Gaussian Naive Bayes

 When the GNB modeling assumptions do not hold, Logistic Regression and GNB typically learn different classifier functions  Logistic Regression is consistent with the Naïve Bayes assumption that the input features Xi are conditionally independent given Y ,it is not rigidly tied to this assumption as is Naive Bayes.  GNB parameter estimates converge toward their asymptotic values in

  • rder log(n) examples, where n is the dimension of X . Logistic

Regression parameter estimates converge more slowly, requiring order (n ) examples.

slide-19
SLIDE 19

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

19

Summary

 Logistic Regression learns the Conditional Probability Distribution P(y|x)  Local Search.  Begins with initial weight vector.  Modifies it iteratively to maximize an objective function.  The objective function is the conditional log likelihood of the data: so the algorithm seeks the probability distribution P(y|x) that is most likely given the data.

slide-20
SLIDE 20

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

20

Any Q Any Questi uestion

  • n

End of Lecture 9 Thank you!

Spring 2015

http://ce.sharif.edu/courses/93-94/2/ce717-1/