Logistic Regression Machine Learning 1 Where are we? We have seen - - PowerPoint PPT Presentation

logistic regression
SMART_READER_LITE
LIVE PREVIEW

Logistic Regression Machine Learning 1 Where are we? We have seen - - PowerPoint PPT Presentation

Logistic Regression Machine Learning 1 Where are we? We have seen the following ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Nave Bayes classifier 2 This lecture


slide-1
SLIDE 1

Machine Learning

Logistic Regression

1

slide-2
SLIDE 2

Where are we?

We have seen the following ideas

– Linear models – Learning as loss minimization – Bayesian learning criteria (MAP and MLE estimation) – The Naïve Bayes classifier

2

slide-3
SLIDE 3

This lecture

  • Logistic regression
  • Connection to Naïve Bayes
  • Training a logistic regression classifier
  • Back to loss minimization

3

slide-4
SLIDE 4

This lecture

  • Logistic regression
  • Connection to Naïve Bayes
  • Training a logistic regression classifier
  • Back to loss minimization

4

slide-5
SLIDE 5

Logistic Regression: Setup

  • The setting

– Binary classification – Inputs: Feature vectors x 2 <d – Labels: y y 2 {-1, +1}

  • Training data

– S = {(x (xi, yi)}, m examples

5

slide-6
SLIDE 6

Classification, but…

The output y is discrete valued (-1 or 1) Instead of predicting the output, let us try to predict P(y = 1 | x) Expand hypothesis space to functions whose output is [0-1]

  • Original problem: <d ! {-1, 1}
  • Modified problem: <d ! [0-1]
  • Effectively make the problem a regression problem

Many hypothesis spaces possible

6

slide-7
SLIDE 7

Classification, but…

The output y is discrete valued (-1 or 1) Instead of predicting the output, let us try to predict P(y = 1 | x) Expand hypothesis space to functions whose output is [0-1]

  • Original problem: <d ! {-1, 1}
  • Modified problem: <d ! [0-1]
  • Effectively make the problem a regression problem

Many hypothesis spaces possible

7

slide-8
SLIDE 8

The Sigmoid function

The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed with a sigmoid function (the logistic function) ¾

What is the domain and the range of the sigmoid function? This is a reasonable choice. We will see why later

8

slide-9
SLIDE 9

The Sigmoid function

The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed with a sigmoid function (the logistic function) ¾

This is a reasonable choice. We will see why later

9

slide-10
SLIDE 10

The Sigmoid function

The hypothesis space for logistic regression: All functions of the form That is, a linear function, composed with a sigmoid function (the logistic function) ¾

What is the domain and the range of the sigmoid function? This is a reasonable choice. We will see why later

10

slide-11
SLIDE 11

The Sigmoid function

¾(z) z

11

slide-12
SLIDE 12

The Sigmoid function

12

What is its derivative with respect to z?

slide-13
SLIDE 13

The Sigmoid function

13

What is its derivative with respect to z?

slide-14
SLIDE 14

Predicting probabilities

According to the logistic regression model, we have

14

slide-15
SLIDE 15

Predicting probabilities

According to the logistic regression model, we have

15

slide-16
SLIDE 16

Predicting probabilities

According to the logistic regression model, we have

16

slide-17
SLIDE 17

Predicting probabilities

According to the logistic regression model, we have Or equivalently

17

slide-18
SLIDE 18

Predicting probabilities

According to the logistic regression model, we have Or equivalently

18

Note that we are directly modeling 𝑄(𝑧 | 𝑦) rather than 𝑄(𝑦 |𝑧) and 𝑄(𝑧)

slide-19
SLIDE 19

Predicting a label with logistic regression

  • Compute P(y =1 | x; w)
  • If this is greater than half, predict 1 else predict -1

– What does this correspond to in terms of wTx?

19

slide-20
SLIDE 20

Predicting a label with logistic regression

  • Compute P(y =1 | x; w)
  • If this is greater than half, predict 1 else predict -1

– What does this correspond to in terms of wTx? – Prediction = sgn(wTx)

20

slide-21
SLIDE 21

This lecture

  • Logistic regression
  • Connection to Naïve Bayes
  • Training a logistic regression classifier
  • Back to loss minimization

21

slide-22
SLIDE 22

Naïve Bayes and Logistic regression

Remember that the naïve Bayes decision is a linear function Here, the P’s represent the Naïve Bayes posterior distribution, and w can be used to calculate the priors and the likelihoods. That is, 𝑄(𝑧 = 1 | 𝐱, 𝐲) is computed using 𝑄(𝐲 | 𝑧 = 1, 𝐱) and 𝑄(𝑧 = 1 | 𝐱)

22

log 𝑄(𝑧 = −1|𝐲, 𝐱) 𝑄(𝑧 = +1|𝐲, 𝐱) = 𝐱2𝐲

slide-23
SLIDE 23

Naïve Bayes and Logistic regression

Remember that the naïve Bayes decision is a linear function But we also know that 𝑄 𝑧 = +1 𝐲, 𝐱 = 1 − 𝑄(𝑧 = −1|𝐲, 𝐱)

23

log 𝑄(𝑧 = −1|𝐲, 𝐱) 𝑄(𝑧 = +1|𝐲, 𝐱) = 𝐱2𝐲

slide-24
SLIDE 24

Naïve Bayes and Logistic regression

Remember that the naïve Bayes decision is a linear function But we also know that 𝑄 𝑧 = +1 𝐲, 𝐱 = 1 − 𝑄(𝑧 = −1|𝐲, 𝐱) Substituting in the above expression, we get

24

log 𝑄(𝑧 = −1|𝐲, 𝐱) 𝑄(𝑧 = +1|𝐲, 𝐱) = 𝐱2𝐲 𝑄 𝑧 = +1 𝐱, 𝐲 = 𝜏 𝐱2𝐲 = 1 1 + exp (−𝐱2𝐲)

slide-25
SLIDE 25

Naïve Bayes and Logistic regression

Remember that the naïve Bayes decision is a linear function But we also know that 𝑄 𝑧 = +1 𝐲, 𝐱 = 1 − 𝑄(𝑧 = −1|𝐲, 𝐱) Substituting in the above expression, we get

25

log 𝑄(𝑧 = −1|𝐲, 𝐱) 𝑄(𝑧 = +1|𝐲, 𝐱) = 𝐱2𝐲 𝑄 𝑧 = +1 𝐱, 𝐲 = 𝜏 𝐱2𝐲 = 1 1 + exp (−𝐱2𝐲)

That is, both naïve Bayes and logistic regression try to compute the same posterior distribution over the outputs Naïve Bayes is a generative model. Logistic Regression is the discriminative version.

slide-26
SLIDE 26

This lecture

  • Logistic regression
  • Connection to Naïve Bayes
  • Training a logistic regression classifier

– First: Maximum likelihood estimation – Then: Adding priors à Maximum a Posteriori estimation

  • Back to loss minimization

26

slide-27
SLIDE 27

Maximum likelihood estimation

Let’s get back to the problem of learning

  • Training data

– S = {(x (xi, yi)}, m examples

  • What we want

– Find a w such that P(S | w) is maximized – We know that our examples are drawn independently and are identically distributed (i.i.d) – How do we proceed?

27

slide-28
SLIDE 28

Maximum likelihood estimation

28

The usual trick: Convert products to sums by taking log Recall that this works only because log is an increasing function and the maximizer will not change

argmax

𝐱

𝑄 𝑇 𝐱 = argmax

𝐱

; 𝑄 𝑧< 𝐲<, 𝐱)

= <>?

slide-29
SLIDE 29

Maximum likelihood estimation

29

Equivalent to solving

argmax

𝐱

𝑄 𝑇 𝐱 = argmax

𝐱

; 𝑄 𝑧< 𝐲<, 𝐱)

= <>?

max

𝐱 @ log 𝑄 𝑧< 𝐲<, 𝐱) = <

slide-30
SLIDE 30

Maximum likelihood estimation

30

But (by definition) we know that

argmax

𝐱

𝑄 𝑇 𝐱 = argmax

𝐱

; 𝑄 𝑧< 𝐲<, 𝐱)

= <>?

max

𝐱 @ log 𝑄 𝑧< 𝐲<, 𝐱) = <

𝑄 𝑧 𝐱, 𝐲 = 𝜏 𝑧<𝐱2𝐲< = 1 1 + exp (−𝑧<𝐱2𝐲<)

slide-31
SLIDE 31

Maximum likelihood estimation

31

argmax

𝐱

𝑄 𝑇 𝐱 = argmax

𝐱

; 𝑄 𝑧< 𝐲<, 𝐱)

= <>?

max

𝐱 @ log 𝑄 𝑧< 𝐲<, 𝐱) = < 𝑄 𝑧 𝐱, 𝐲 = 1 1 + exp (−yB𝐱2𝐲<) Equivalent to solving

max

𝐱 @ −log(1 + exp

(−𝑧<𝐱2𝐲<)

= <

slide-32
SLIDE 32

Maximum likelihood estimation

32

argmax

𝐱

𝑄 𝑇 𝐱 = argmax

𝐱

; 𝑄 𝑧< 𝐲<, 𝐱)

= <>?

max

𝐱 @ log 𝑄 𝑧< 𝐲<, 𝐱) = < 𝑄 𝑧 𝐱, 𝐲 = 1 1 + exp (−yB𝐱2𝐲<) Equivalent to solving The goal: Maximum likelihood training of a discriminative probabilistic classifier under the logistic model for the posterior distribution.

max

𝐱 @ −log(1 + exp

(−𝑧<𝐱2𝐲<)

= <

slide-33
SLIDE 33

Maximum likelihood estimation

33

argmax

𝐱

𝑄 𝑇 𝐱 = argmax

𝐱

; 𝑄 𝑧< 𝐲<, 𝐱)

= <>?

max

𝐱 @ log 𝑄 𝑧< 𝐲<, 𝐱) = < 𝑄 𝑧 𝐱, 𝐲 = 1 1 + exp (−yB𝐱2𝐲<) Equivalent to solving

max

𝐱 @ −log(1 + exp

(−𝑧<𝐱2𝐲<)

= < Equivalent to: Training a linear classifier by minimizing the logistic loss. The goal: Maximum likelihood training of a discriminative probabilistic classifier under the logistic model for the posterior distribution.

slide-34
SLIDE 34

Maximum a posteriori estimation

We could also add a prior on the weights Suppose each weight in the weight vector is drawn independently from the normal distribution with zero mean and standard deviation 𝜏

𝑞 𝐱 = ; 𝑞(𝑥<)

E F>?

= ; 1 𝜏 2𝜌

  • exp −𝑥<

J

𝜏J

E F>?

34

slide-35
SLIDE 35

MAP estimation for logistic regression

35

𝑞 𝐱 = ; 𝑞(𝑥<)

E F>?

= ; 1 𝜏 2𝜌

  • exp −𝑥<

J

𝜏J

E F>?

Let us work through this procedure again to see what changes

slide-36
SLIDE 36

MAP estimation for logistic regression

36

𝑞 𝐱 = ; 𝑞(𝑥<)

E F>?

= ; 1 𝜏 2𝜌

  • exp −𝑥<

J

𝜏J

E F>?

Let us work through this procedure again to see what changes What is the goal of MAP estimation? (In maximum likelihood, we maximized the likelihood of the data)

slide-37
SLIDE 37

MAP estimation for logistic regression

37

𝑞 𝐱 = ; 𝑞(𝑥<)

E F>?

= ; 1 𝜏 2𝜌

  • exp −𝑥<

J

𝜏J

E F>?

What is the goal of MAP estimation? (In maximum likelihood, we maximized the likelihood of the data) To maximize the posterior probability of the model given the data (i.e. to find the most probable model, given the data) 𝑄 𝐱 𝑇 ∝ 𝑄 𝑇 𝐱 𝑄(𝐱)

slide-38
SLIDE 38

MAP estimation for logistic regression

38

Learning by solving

𝑞 𝐱 = ; 𝑞(𝑥<)

E F>?

= ; 1 𝜏 2𝜌

  • exp −𝑥<

J

𝜏J

E F>?

argmax

𝐱

𝑄(𝐱|𝑇) = argmax

𝐱

𝑄 𝑇 𝐱 𝑄(𝐱)

slide-39
SLIDE 39

MAP estimation for logistic regression

39

Learning by solving

𝑞 𝐱 = ; 𝑞(𝑥<)

E F>?

= ; 1 𝜏 2𝜌

  • exp −𝑥<

J

𝜏J

E F>?

argmax

𝐱

𝑄 𝑇 𝐱 𝑄(𝐱)

Take log to simplify

max

𝐱 log 𝑄 𝑇 𝐱 + log 𝑄(𝐱)

slide-40
SLIDE 40

MAP estimation for logistic regression

40

Learning by solving

𝑞 𝐱 = ; 𝑞(𝑥<)

E F>?

= ; 1 𝜏 2𝜌

  • exp −𝑥<

J

𝜏J

E F>?

argmax

𝐱

𝑄 𝑇 𝐱 𝑄(𝐱)

Take log to simplify

max

𝐱 log 𝑄 𝑇 𝐱 + log 𝑄(𝐱)

We have already expanded out the first term.

@ −log(1 + exp (−𝑧<𝐱2𝐲<)

= <

slide-41
SLIDE 41

MAP estimation for logistic regression

41

Learning by solving

𝑞 𝐱 = ; 𝑞(𝑥<)

E F>?

= ; 1 𝜏 2𝜌

  • exp −𝑥<

J

𝜏J

E F>?

argmax

𝐱

𝑄 𝑇 𝐱 𝑄(𝐱)

Take log to simplify

max

𝐱 log 𝑄 𝑇 𝐱 + log 𝑄(𝐱)

@ −log(1 + exp (−𝑧<𝐱2𝐲<)

= <

+ @ −𝑥<

J

𝜏J

E F>?

+ 𝑑𝑝𝑜𝑡𝑢𝑏𝑜𝑢𝑡

Expand the log prior

slide-42
SLIDE 42

MAP estimation for logistic regression

42

Learning by solving

𝑞 𝐱 = ; 𝑞(𝑥<)

E F>?

= ; 1 𝜏 2𝜌

  • exp −𝑥<

J

𝜏J

E F>?

argmax

𝐱

𝑄 𝑇 𝐱 𝑄(𝐱)

Take log to simplify

max

𝐱 log 𝑄 𝑇 𝐱 + log 𝑄(𝐱)

max

𝐱 @ −log(1 + exp

(−𝑧<𝐱2𝐲<)

= <

+ @ −𝑥<

J

𝜏J

E F>?

+ 𝑑𝑝𝑜𝑡𝑢𝑏𝑜𝑢𝑡

slide-43
SLIDE 43

MAP estimation for logistic regression

43

Learning by solving

𝑞 𝐱 = ; 𝑞(𝑥<)

E F>?

= ; 1 𝜏 2𝜌

  • exp −𝑥<

J

𝜏J

E F>?

argmax

𝐱

𝑄 𝑇 𝐱 𝑄(𝐱)

Take log to simplify

max

𝐱 log 𝑄 𝑇 𝐱 + log 𝑄(𝐱)

max

𝐱 @ −log(1 + exp

(−𝑧<𝐱2𝐲<)

= <

− 1 𝜏J 𝐱2𝐱

slide-44
SLIDE 44

MAP estimation for logistic regression

44

Learning by solving

𝑞 𝐱 = ; 𝑞(𝑥<)

E F>?

= ; 1 𝜏 2𝜌

  • exp −𝑥<

J

𝜏J

E F>?

argmax

𝐱

𝑄 𝑇 𝐱 𝑄(𝐱)

Take log to simplify

max

𝐱 log 𝑄 𝑇 𝐱 + log 𝑄(𝐱)

max

𝐱 @ −log(1 + exp

(−𝑧<𝐱2𝐲<)

= <

− 1 𝜏J 𝐱2𝐱

Maximizing a negative function is the same as minimizing the function

slide-45
SLIDE 45

Learning a logistic regression classifier

Learning a logistic regression classifier is equivalent to solving

45

min

𝐱 @ log(1 + exp

(−𝑧<𝐱2𝐲<)

= <

+ 1 𝜏J 𝐱2𝐱

slide-46
SLIDE 46

Learning a logistic regression classifier

Learning a logistic regression classifier is equivalent to solving

46

Where have we seen this before?

min

𝐱 @ log(1 + exp

(−𝑧<𝐱2𝐲<)

= <

+ 1 𝜏J 𝐱2𝐱

slide-47
SLIDE 47

Learning a logistic regression classifier

Learning a logistic regression classifier is equivalent to solving

47

Where have we seen this before? The first question in the homework: Write down the stochastic gradient descent algorithm for this? Historically, other training algorithms exist. In particular, you might run into LBFGS

min

𝐱 @ log(1 + exp

(−𝑧<𝐱2𝐲<)

= <

+ 1 𝜏J 𝐱2𝐱

slide-48
SLIDE 48

Logistic regression is…

  • A classifier that predicts the probability that the label is

+1 for a particular input

  • The discriminative counter-part of the naïve Bayes

classifier

  • A discriminative classifier that can be trained via MAP or

MLE estimation

  • A discriminative classifier that minimizes the logistic loss
  • ver the training set

48

slide-49
SLIDE 49

This lecture

  • Logistic regression
  • Connection to Naïve Bayes
  • Training a logistic regression classifier
  • Back to loss minimization

49

slide-50
SLIDE 50

Learning as loss minimization

  • The setup

– Examples x drawn from a fixed, unknown distribution D – Hidden oracle classifier f labels examples – We wish to find a hypothesis h that mimics f

  • The ideal situation

– Define a function L that penalizes bad hypotheses – Learning: Pick a function h 2 H to minimize expected loss

  • Instead, minimize empirical loss on the training set

50

But distribution D is unknown

slide-51
SLIDE 51

Empirical loss minimization

Learning = minimize empirical loss on the training set

51

Is there a problem here?

slide-52
SLIDE 52

Empirical loss minimization

Learning = minimize empirical loss on the training set We need something that biases the learner towards simpler hypotheses

  • Achieved using a regularizer, which penalizes complex

hypotheses

52

Is there a problem here?

Overfitting!

slide-53
SLIDE 53

Regularized loss minimization

  • Learning:
  • With linear classifiers:
  • What is a loss function?

– Loss functions should penalize mistakes – We are minimizing average loss over the training data

  • What is the ideal loss function for classification?

53

(using l2 regularization)

slide-54
SLIDE 54

The 0-1 loss

Penalize classification mistakes between true label y and prediction y’

  • For linear classifiers, the prediction y’ = sgn(wTx)

– Mistake if y wTx · 0

Minimizing 0-1 loss is intractable. Need surrogates

54

slide-55
SLIDE 55

The loss function zoo

Many loss functions exist

– Perceptron loss – Hinge loss (SVM) – Exponential loss (AdaBoost) – Logistic loss (logistic regression)

55

slide-56
SLIDE 56

The loss function zoo

56

slide-57
SLIDE 57

The loss function zoo

57

Zero-one

slide-58
SLIDE 58

The loss function zoo

58

Hinge: SVM Zero-one

slide-59
SLIDE 59

The loss function zoo

59

Perceptron Hinge: SVM Zero-one

slide-60
SLIDE 60

The loss function zoo

60

Perceptron Hinge: SVM Exponential: AdaBoost Zero-one

slide-61
SLIDE 61

The loss function zoo

61

Perceptron Hinge: SVM Logistic regression Exponential: AdaBoost Zero-one

slide-62
SLIDE 62

The loss function zoo

62

Zoomed out

slide-63
SLIDE 63

The loss function zoo

63

Zoomed out even more