Lecture #11: Logistic Regression - Part II Data Science 1 CS 109A, - - PowerPoint PPT Presentation

lecture 11 logistic regression part ii
SMART_READER_LITE
LIVE PREVIEW

Lecture #11: Logistic Regression - Part II Data Science 1 CS 109A, - - PowerPoint PPT Presentation

Lecture #11: Logistic Regression - Part II Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas Kevin Rader Margo Levine Rahul Dave Lecture Outline Logistic Regression: a Brief Review Classification Boundaries


slide-1
SLIDE 1

Lecture #11: Logistic Regression - Part II

Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas Kevin Rader Margo Levine Rahul Dave

slide-2
SLIDE 2

Lecture Outline

Logistic Regression: a Brief Review Classification Boundaries Regularization in Logistic Regression Multinomial Logistic Regression Bayes Theorem and Misclassification Rates ROC Curves

2

slide-3
SLIDE 3

Logistic Regression: a Brief Review

3

slide-4
SLIDE 4

Multiple Logistic Regression

Earlier we saw the general form of simple logistic regression, meaning when there is just one predictor used in the model. What was the model statement (in terms of linear predictors)? log Multiple logistic regression is a generalization to multiple

  • predictors. More specifically we can define a multiple logistic

regression model to predict as such: log where there are predictors: . Note: statisticians are often lazy and use the notation log to mean ln (the text does this). We will write log if this is what we mean.

4

slide-5
SLIDE 5

Multiple Logistic Regression

Earlier we saw the general form of simple logistic regression, meaning when there is just one predictor used in the model. What was the model statement (in terms of linear predictors)? log ( P(Y = 1) 1 − P(Y = 1) ) = β0 + β1X Multiple logistic regression is a generalization to multiple

  • predictors. More specifically we can define a multiple logistic

regression model to predict as such: log where there are predictors: . Note: statisticians are often lazy and use the notation log to mean ln (the text does this). We will write log if this is what we mean.

4

slide-6
SLIDE 6

Multiple Logistic Regression

Earlier we saw the general form of simple logistic regression, meaning when there is just one predictor used in the model. What was the model statement (in terms of linear predictors)? log ( P(Y = 1) 1 − P(Y = 1) ) = β0 + β1X Multiple logistic regression is a generalization to multiple

  • predictors. More specifically we can define a multiple logistic

regression model to predict P(Y = 1) as such: log ( P(Y = 1) 1 − P(Y = 1) ) = β0 + β1X1 + β2X2 + ... + βpXp where there are p predictors: X = (X1, X2, ..., Xp). Note: statisticians are often lazy and use the notation log to mean ln (the text does this). We will write log10 if this is what we mean.

4

slide-7
SLIDE 7

Interpreting Multiple Logistic Regression: an Example

Let’s get back to the NFL data. We are attempting to predict whether a play results in a TD based on location (yard line) and whether the play was a pass. The simultaneous effect of these two predictors can be brought into one model. Recall from earlier we had the following estimated models: log (

  • P(Y = 1)

1 −

  • P(Y = 1)

) = −7.425 + 0.0626 · Xyard log (

  • P(Y = 1)

1 −

  • P(Y = 1)

) = −4.061 + 1.106 · Xpass The results for the multiple logistic regression model are on the next slide.

5

slide-8
SLIDE 8

Interpreting Multiple Logistic Regression: an Example

6

slide-9
SLIDE 9

Some questions

  • 1. Write down the complete model. Break this down into the

model to predict log-odds of a touchdown based on the yard line for passes and the same model for non-passes. How is this different from the previous model (without interaction)?

  • 2. Estimate the odds ratio of a TD comparing passes to

non-passes.

  • 3. Is there any evidence of multicollinearity in this model?
  • 4. Is there any confounding in this problem?

7

slide-10
SLIDE 10

Interactions in Multiple Logistic Regression

Just like in linear regression, interaction terms can be considered in logistic regression. An interaction terms is incorporated into the model the same way, and the interpretation is very similar (on the log-odds scale of the response of course). Write down the model for the NFL data for the 2 predictors plus the interactions term.

8

slide-11
SLIDE 11

Interpreting Multiple Logistic Regression with Interaction: an Example

9

slide-12
SLIDE 12

Some questions

  • 1. Write down the complete model. Break this down into the

model to predict log-odds of a touchdown based on the yard line for passes and the same model for non-passes. How is this different from the previous model (without interaction)?

  • 2. Use this model to estimate the probability of a

touchdown for a pass at the 20 yard line. Do the same for a run at the 20 yard line.

  • 3. Use this model to estimate the probability of a

touchdown for a pass at the 99 yard line. Do the same for a run at the 99 yard line.

  • 4. Is this a stronger model than the previous one? How

would we check?

10

slide-13
SLIDE 13

Classification Boundaries

11

slide-14
SLIDE 14

Classification

Recall that we could attempt to purely classify each

  • bservation based on whether the estimated P(Y = 1) from

the model was greater than 0.5. When dealing with ‘well-separated’ data, logistic regression can work well in performing classification. We saw a 2-D plot last time which had two predictors, X1 and X2 and depicted the classes as different colors. A similar one is shown on the next slide.

12

slide-15
SLIDE 15

2D Classification in Logistic Regression: an Example

13

slide-16
SLIDE 16

2D Classification in Logistic Regression: an Example

Would a logistic regression model perform well in classifying the observations in this example? What would be a good logistic regression model to classify these points? Based on these predictors, two separate logistic regression model were considered that were based on different ordered polynomials of and and their interactions. The ‘circles’ represent the boundary for classification. How can the classification boundary be calculated for a logistic regression?

14

slide-17
SLIDE 17

2D Classification in Logistic Regression: an Example

Would a logistic regression model perform well in classifying the observations in this example? What would be a good logistic regression model to classify these points? Based on these predictors, two separate logistic regression model were considered that were based on different ordered polynomials of and and their interactions. The ‘circles’ represent the boundary for classification. How can the classification boundary be calculated for a logistic regression?

14

slide-18
SLIDE 18

2D Classification in Logistic Regression: an Example

Would a logistic regression model perform well in classifying the observations in this example? What would be a good logistic regression model to classify these points? Based on these predictors, two separate logistic regression model were considered that were based on different ordered polynomials of X1 and X2 and their interactions. The ‘circles’ represent the boundary for classification. How can the classification boundary be calculated for a logistic regression?

14

slide-19
SLIDE 19

2D Classification in Logistic Regression: an Example

In the previous plot, which classification boundary performs better? How can you tell? How would you make this determination in an actual data example? We could determine the misclassification rates in left out validation or test set(s)

15

slide-20
SLIDE 20

2D Classification in Logistic Regression: an Example

In the previous plot, which classification boundary performs better? How can you tell? How would you make this determination in an actual data example? We could determine the misclassification rates in left out validation or test set(s)

15

slide-21
SLIDE 21

Regularization in Logistic Regression

16

slide-22
SLIDE 22

Regularization in Linear Regression

Based on the Likelihood framework, a loss function can be determined based on the likelihood function. We saw in linear regression that maximizing the log-likelihood is equivalent to minimizing the sum of squares error:

arg min arg min

And a regularization approach was to add a penalty factor to this equation. Which for Ridge Regression becomes:

arg min

This penalty shrinks the estimates towards zero, and had the analogue of using a Normal prior in the Bayesian paradigm.

17

slide-23
SLIDE 23

Regularization in Linear Regression

Based on the Likelihood framework, a loss function can be determined based on the likelihood function. We saw in linear regression that maximizing the log-likelihood is equivalent to minimizing the sum of squares error:

arg min

n

i=1

(yi − ˆ yi)2 = arg min

n

i=1

(yi − (β0 + β1x1i + ... + βpxpi))2

And a regularization approach was to add a penalty factor to this equation. Which for Ridge Regression becomes:

arg min

This penalty shrinks the estimates towards zero, and had the analogue of using a Normal prior in the Bayesian paradigm.

17

slide-24
SLIDE 24

Regularization in Linear Regression

Based on the Likelihood framework, a loss function can be determined based on the likelihood function. We saw in linear regression that maximizing the log-likelihood is equivalent to minimizing the sum of squares error:

arg min

n

i=1

(yi − ˆ yi)2 = arg min

n

i=1

(yi − (β0 + β1x1i + ... + βpxpi))2

And a regularization approach was to add a penalty factor to this equation. Which for Ridge Regression becomes:

arg min   

n

i=1

 yi −  β0 +

n

j=1

βjxji    

2

+ λ

n

j=1

β2

j

  

This penalty shrinks the estimates towards zero, and had the analogue of using a Normal prior in the Bayesian paradigm.

17

slide-25
SLIDE 25

Loss function in Logistic Regression

A similar approach can be used in logistic regression. Here, maximizing the log-likelihood is equivalent to minimizing the following loss function:

arg min [ −

n

i=1

(yi log(ˆ pi) + (1 − yi) log(1 − ˆ pi) ]

where ˆ pi =

exp(β0+∑n

j=1 βjxji)

1+exp(β0+∑n

j=1 βjxji).

Why is this a good loss function to minimize? Where does this come from? The log-likelihood for independent

Bern :

18

slide-26
SLIDE 26

Loss function in Logistic Regression

A similar approach can be used in logistic regression. Here, maximizing the log-likelihood is equivalent to minimizing the following loss function:

arg min [ −

n

i=1

(yi log(ˆ pi) + (1 − yi) log(1 − ˆ pi) ]

where ˆ pi =

exp(β0+∑n

j=1 βjxji)

1+exp(β0+∑n

j=1 βjxji).

Why is this a good loss function to minimize? Where does this come from? The log-likelihood for independent Yi ∼ Bern (pi):

18

slide-27
SLIDE 27

Regularization in Logistic Regression

A penalty factor can then be added to this loss function and results in a new loss function that penalizes large values of the parameters:

arg min [ −

n

i=1

[yi log(ˆ pi) + (1 − yi) log(1 − ˆ pi)] + λ

n

j=1

β2

j

]

The result is just like in linear regression: shrinkage towards zero of the parameters. In practice, the intercept is usually not part of the penalty factor, and is thus not shrunk towards zero. Note: the sklearn package uses a different tuning parameter: instead of λ they use a constant that is essentially C = 1/λ.

19

slide-28
SLIDE 28

Regularization in Logistic Regression: an Example

Let’s see how this plays out in an example in logistic regression.

20

slide-29
SLIDE 29

Regularization in Logistic Regression: an Example

21

slide-30
SLIDE 30

Regularization in Logistic Regression: an Example

22

slide-31
SLIDE 31

Regularization in Logistic Regression: an Example

Just like in linear regression, the shrinkage factor must be

  • chosen. How should we go about doing this?

Through building multiple training and test sets (through

  • fold or random subsets), we can select the best shrinkage

factor to mimic out-of-sample prediction. How could we measure how well each model fits the test set? We could measure this based on the proposed loss function!

23

slide-32
SLIDE 32

Regularization in Logistic Regression: an Example

Just like in linear regression, the shrinkage factor must be

  • chosen. How should we go about doing this?

Through building multiple training and test sets (through k-fold or random subsets), we can select the best shrinkage factor to mimic out-of-sample prediction. How could we measure how well each model fits the test set? We could measure this based on the proposed loss function!

23

slide-33
SLIDE 33

Multinomial Logistic Regression

24

slide-34
SLIDE 34

Logistic Regression for predicting more than 2 Classes

There are several extensions to standard logistic regression when the response variable Y has more than 2 categories. The two most common are :

  • 1. ordinal logistic regression
  • 2. multinomial logistic regression.

Ordinal logistic regression is used when the categories have a specific hierarchy (like class year: Freshman, Sophomore, Junior, Senior; or a 7-point rating scale from strongly disagree to strongly agree). Multinomial logistic regression is used when the categories have no inherent order (like eye color: blue, green, brown, hazel, et...).

25

slide-35
SLIDE 35

Logistic Regression for predicting more than 2 Classes

There are several extensions to standard logistic regression when the response variable Y has more than 2 categories. The two most common are :

  • 1. ordinal logistic regression
  • 2. multinomial logistic regression.

Ordinal logistic regression is used when the categories have a specific hierarchy (like class year: Freshman, Sophomore, Junior, Senior; or a 7-point rating scale from strongly disagree to strongly agree). Multinomial logistic regression is used when the categories have no inherent order (like eye color: blue, green, brown, hazel, et...).

25

slide-36
SLIDE 36

Multinomial Logistic Regression

The most common approach to estimating a nominal (not-ordinal) categorical variable that has more than 2

  • classes. The first approach sets one of the categories in the

response variable as the reference group, and then fits separate logistic regression models to predict the other cases based off of the reference group. For example we could attempt to predict a student’s concentration: y =    1 if Computer Science (CS) 2 if Statistics 3

  • therwise

. from predictors x1 number of psets per week and x2 how much time spent in Lamont Library.

26

slide-37
SLIDE 37

Multinomial Logistic Regression (cont.)

We could select the y = 3 case as the reference group (other concentration), and then fit two separate models: a model to predict y = 1 (CS) from y = 3 (others) and a separate model to predict y = 2 (Stat) from y = 3 (others). Ignoring interactions, how many parameters would need to be estimated? How could these models be used to estimate the probability

  • f an individual falling in each concentration?

27

slide-38
SLIDE 38

One vs. Rest (ovr) Logistic Regression (cont.)

The default multiclass logistic regression model is called the ’One vs. Rest’ approach. If there are 3 classes, then 3 separate logistic regressions are fit, where the probability of each category is predicted over the rest of the categories combined. So for the concentration example, 3 models would be fit:

  • 1. a first model would be fit to predict CS from (Stat and

Others) combined

  • 2. a second model would be fit to predict Stat from (CS and

Others) combined

  • 3. a third model would be fit to predict Others from (CS and

Stat) combined An example to predict play call from the NFL data follows...

28

slide-39
SLIDE 39

OVR Logistic Regression in Python

29

slide-40
SLIDE 40

Classification for more than 2 Categories

When there are more than 2 categories in the response variable, then there is no guarantee that P(Y = k) ≥ 0.5 for any one category. So any classifier based on logistic regression will instead have to select the group with the largest estimated probability. The classification boundaries are then much more difficult to

  • determine. We will not get into the algorithm for drawing

these in this class.

30

slide-41
SLIDE 41

Bayes Theorem and Misclassification Rates

31

slide-42
SLIDE 42

Bayes’ Theorem

We defined conditional probability as: P(B|A) = P(B ∩ A) P(A) And using the fact that P(B ∩ A) = P(A|B)P(B) we get Bayes’ Theroem: P(B|A) = P(A|B)P(B) P(A) Another version of Bayes’ Theorem is found by substituting in the Law of Total Probability (LOTP) into the denominator: P(B|A) = P(A|B)P(B) P(A|B)P(B) + P(A|BC)P(BC) Where have we seen Bayes’ Theorem before? Why do we care?

32

slide-43
SLIDE 43

Diagnostic Testing

In the diagnostic testing paradigm, one cares about whether the results of a test (like a classification test) matches truth (the true class that observation belongs to). The simplest version of this is trying to detect disease (D+ vs. D−) based

  • n a diagnostic test (T+ vs. T−).

Medical examples of this include various screening tests: breast cancer screening through (i) self-examination and (ii) mammographies, prostate cancer screening through (iii) PSA tests, and Colo-rectal cancer through (iv) colonoscopies. These tests are a little controversial because of poor predictive probability of the tests.

33

slide-44
SLIDE 44

Diagnostic Testing (cont.)

Bayes’ theorem can be rewritten for diagnostic tests: P(D + |T+) = P(T + |D+)P(D+) P(T + |D+)P(D+) + P(T + |D−)P(D−) These probability quantities can then be defined as:

▶ Sensitivity: P(T + |D+) ▶ Specificity: P(T − |D−) ▶ Prevalence: P(D+) ▶ Positive Predictive Value: P(D + |T+) ▶ Negative Predictive Value: P(D − |T−)

How do positive and negative predictive values relate? Be careful...

34

slide-45
SLIDE 45

Diagnostic Testing (cont.)

We mentioned that these tests are a little controversial because of their poor predictive probability. When will these tests have poor positive predictive probability? When the disease is not very prevalent, then the number of ’false positives’ will overwhelm the number of true positive. For example, PSA screening for prostate cancer has sensitivity of about 90% and specificity of about 97% for some age groups (men in their fifties), but prevalence is about 0.1%. What is positive predictive probability for this diagnostic test?

35

slide-46
SLIDE 46

Diagnostic Testing (cont.)

We mentioned that these tests are a little controversial because of their poor predictive probability. When will these tests have poor positive predictive probability? When the disease is not very prevalent, then the number of ’false positives’ will overwhelm the number of true positive. For example, PSA screening for prostate cancer has sensitivity of about 90% and specificity of about 97% for some age groups (men in their fifties), but prevalence is about 0.1%. What is positive predictive probability for this diagnostic test?

35

slide-47
SLIDE 47

Why do we care?

As data scientists, why do we care about diagnostic testing from the medical world? (hint: it’s not just because Kevin is a trained biostatistician!) Because classification can be thought of as a diagnostic test. Let be the event that observation truly belongs to category , and let be the event that we correctly predict it to be in class . Then Bayes’ rule states that our Positive Predictive Value for classification is: Thus the probability of a predicted outcome truly being in a specific group depends on what? The proportion of

  • bservations in that class!

36

slide-48
SLIDE 48

Why do we care?

As data scientists, why do we care about diagnostic testing from the medical world? (hint: it’s not just because Kevin is a trained biostatistician!) Because classification can be thought of as a diagnostic test. Let Yi = k be the event that observation i truly belongs to category k, and let ˆ Yi = k be the event that we correctly predict it to be in class k. Then Bayes’ rule states that our Positive Predictive Value for classification is:

P(Yi = k| ˆ Yi = k) = P( ˆ Yi = k|Yi = k)P(Yi = k) P( ˆ Yi = k|Yi = k)P(Yi = k) + P( ˆ Yi = k|Yi ̸= k)P(Yi ̸= k)

Thus the probability of a predicted outcome truly being in a specific group depends on what? The proportion of

  • bservations in that class!

36

slide-49
SLIDE 49

Why do we care?

As data scientists, why do we care about diagnostic testing from the medical world? (hint: it’s not just because Kevin is a trained biostatistician!) Because classification can be thought of as a diagnostic test. Let Yi = k be the event that observation i truly belongs to category k, and let ˆ Yi = k be the event that we correctly predict it to be in class k. Then Bayes’ rule states that our Positive Predictive Value for classification is:

P(Yi = k| ˆ Yi = k) = P( ˆ Yi = k|Yi = k)P(Yi = k) P( ˆ Yi = k|Yi = k)P(Yi = k) + P( ˆ Yi = k|Yi ̸= k)P(Yi ̸= k)

Thus the probability of a predicted outcome truly being in a specific group depends on what? The proportion of

  • bservations in that class!

36

slide-50
SLIDE 50

Error in Classification

There are 2 major types of error in classification problems based on a binary outcome. They are:

▶ False positives: incorrectly predicting ˆ

Y = 1 when it truly is in Y = 0.

▶ False negative: incorrectly predicting ˆ

Y = 0 when it truly is in Y = 1. The results of a classification algorithm are often summarized in two ways: a confusion table, sometimes called a contingency table, or a 2x2 table (more generally kxk table) and an receiver operating characteristics (ROC) curve.

37

slide-51
SLIDE 51

Confusion table

When a classification algorithm (like logistic regression) is used, the results can be summarize in a kxk table as such: True Republican Status Yes No Predicted Yes 487 288 Republican No 218 314 The table above was a classification based on a logistic regression model to predict political party (Dem. vs. Rep.) based on 3 predictors: X1 = whether respondent believes abortion is legal, X1 = income (logged) and X3 = years of education. What are the false positive and false negative rates for this classifier?

38

slide-52
SLIDE 52

Bayes’ Classifier Choice

A classifier’s error rates can be tuned to modify this table. How? The choice of the Bayes’ classifier level will modify the characteristics of this table. If we thought is was more important to predict republicans correctly (lower false positive rate), what could we do for our Bayes’ classifier level? We could classify instead based on: and we could choose to be some level other than 0.5. Let’s see what the table looks like if were 0.28 or 0.52 instead (why such strange numbers?).

39

slide-53
SLIDE 53

Bayes’ Classifier Choice

A classifier’s error rates can be tuned to modify this table. How? The choice of the Bayes’ classifier level will modify the characteristics of this table. If we thought is was more important to predict republicans correctly (lower false positive rate), what could we do for our Bayes’ classifier level? We could classify instead based on: and we could choose to be some level other than 0.5. Let’s see what the table looks like if were 0.28 or 0.52 instead (why such strange numbers?).

39

slide-54
SLIDE 54

Bayes’ Classifier Choice

A classifier’s error rates can be tuned to modify this table. How? The choice of the Bayes’ classifier level will modify the characteristics of this table. If we thought is was more important to predict republicans correctly (lower false positive rate), what could we do for our Bayes’ classifier level? We could classify instead based on: ˆ P(Y = 1) < π and we could choose π to be some level other than 0.5. Let’s see what the table looks like if π were 0.28 or 0.52 instead (why such strange numbers?).

39

slide-55
SLIDE 55

Other Confusion table

Based on π = 0.28: True Republican Status Yes No Predicted Yes 247 528 Republican No 80 452 What has improved? What has worsened? Based on : True Republican Status Yes No Predicted Yes 627 148 Republican No 388 144 Which should we choose? Why?

40

slide-56
SLIDE 56

Other Confusion table

Based on π = 0.28: True Republican Status Yes No Predicted Yes 247 528 Republican No 80 452 What has improved? What has worsened? Based on π = 0.52: True Republican Status Yes No Predicted Yes 627 148 Republican No 388 144 Which should we choose? Why?

40

slide-57
SLIDE 57

ROC Curves

41

slide-58
SLIDE 58

ROC Curves

The ROC curve illustrates the trade-off for all possible thresholds chosen for the two types of error (or correct classification). The vertical axis displays the true positive predictive value and the horizontal axis depicts the true negative predictive value. What is the shape of an ideal ROC curve? See next slide for an example.

42

slide-59
SLIDE 59

ROC Curves

The ROC curve illustrates the trade-off for all possible thresholds chosen for the two types of error (or correct classification). The vertical axis displays the true positive predictive value and the horizontal axis depicts the true negative predictive value. What is the shape of an ideal ROC curve? See next slide for an example.

42

slide-60
SLIDE 60

ROC Curve Example

43

slide-61
SLIDE 61

ROC Curve for measuring classifier preformance

The overall performance of a classifier, calculated over all possible thresholds, is given by the area under the ROC curve (’AUC’). An ideal ROC curve will hug the top left corner, so the larger the AUC the better the classifier. What is the worst case scenario for AUC? What is the best case? What is AUC if we independently just flip a coin to perform classification? This AUC then can be use to compare various approaches to classification: Logistic regression, LDA (to come), kNN, etc...

44

slide-62
SLIDE 62

ROC Curve for measuring classifier preformance

The overall performance of a classifier, calculated over all possible thresholds, is given by the area under the ROC curve (’AUC’). An ideal ROC curve will hug the top left corner, so the larger the AUC the better the classifier. What is the worst case scenario for AUC? What is the best case? What is AUC if we independently just flip a coin to perform classification? This AUC then can be use to compare various approaches to classification: Logistic regression, LDA (to come), kNN, etc...

44