Special Topics Some complex model-building problems can be handled - - PowerPoint PPT Presentation

▶

Nov 18, 2023 276 likes •445 views

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Special Topics Some complex model-building problems can be handled using the linear regression approach covered up to this point. For example,

SLIDE 1

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Special Topics

Some complex model-building problems can be handled using the linear regression approach covered up to this point. For example, piecewise regression, including piecewise linear regression and spline regression. Some require more general nonlinear approaches. For example, logistic and probit regression for binary responses.

1 / 15 Special Topics Introduction

SLIDE 2

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Logistic Regression

Linear regression methods are used to evaluate the impact of various factors on a response. When the response Y is binary (0 or 1), linear methods have problems. Because E(Y |x) = P(Y = 1|x), the linear regression model E(Y |x) = β0 + β1x will often predict probabilities that are either negative or greater than 1.

2 / 15 Special Topics Logistic Regression

SLIDE 3

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

The most common alternative is based on modeling the log odds ratio: π(x) = P(Y = 1|x) π(x) 1 − π(x) = the odds ratio log

π(x)

1 − π(x)

= log odds ratio, or just log odds.

In the logistic regression model, we assume log

π(x)

1 − π(x)

= β0 + β1x1 + · · · + βkxk.

3 / 15 Special Topics Logistic Regression

SLIDE 4

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Solving for π(x), we find P(Y = 1|x) = π(x) = exp (β0 + β1x1 + · · · + βkxk) 1 + exp (β0 + β1x1 + · · · + βkxk). Consequently P(Y = 0|x) = 1 − π(x) = 1 1 + exp (β0 + β1x1 + · · · + βkxk). As a function of any xj, π(x) changes smoothly from 0 to 1. It is increasing if βj > 0, and decreasing if βj < 0.

4 / 15 Special Topics Logistic Regression

SLIDE 5

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

The function F(x) = exp(x) 1 + exp(x) is the cdf of the logistic distribution:

curve(exp(x)/(1 + exp(x)), from = -5, to = 5)

It is similar to the cdf of the normal distribution with the matching variance ( π2

3 ):

curve(pnorm(x, 0, sqrt(pi^2/3)), add = TRUE, col = "red")

5 / 15 Special Topics Logistic Regression

SLIDE 6

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Interpreting the parameters The coefficient βj measures the change in the log odds associated with a change of +1 in xj. So eβj is the proportional change in the odds associated with the same change. When xj is an indicator variable, eβj is often interpreted as the relative risk that Y = 1 for the group where xj = 1, relative to the group where xj = 0.

6 / 15 Special Topics Logistic Regression

SLIDE 7

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Example: fraud detection Data are credit card transactions. The response is Y , where Y =

if the transaction is fraudulent

therwise.

The predictors are information about the card holder (credit limit, etc.) and about the transaction (amount, etc.). The fitted ˆ π(x) can be used to predict the probability that a new transaction will prove to be fraudulent.

7 / 15 Special Topics Logistic Regression

SLIDE 8

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Estimation The usual approach to estimating β0, β1, . . . , βk is by maximum likelihood. It is implemented in proc logistic and proc genmod in SAS, and in the glm() function in R. The names “genmod” and “glm” are abbreviations of generalized linear model, of which logistic regression is a particular case.

8 / 15 Special Topics Logistic Regression

SLIDE 9

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Example: collusive bidding in Florida road construction.

bids <- read.table("Text/Exercises&Examples/ROADBIDS.txt", header = TRUE) pairs(bids)

Using glm() is very similar to using lm():

g <- glm(STATUS ~ NUMBIDS + DOTEST, bids, family = binomial) summary(g)

The argument family = binomial specifies that the response, STATUS, has the binomial (strictly, the Bernoulli) distribution.

9 / 15 Special Topics Logistic Regression

SLIDE 10

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

The output is also similar to that of lm(). Note that instead of a column of t-values, there is a column of z-values. Like a t-value, a z-value is the ratio of a parameter estimate to its standard error. The label indicates that you test the significance of the parameter using the normal distribution, not the t-distribution.

10 / 15 Special Topics Logistic Regression

SLIDE 11

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Because this is not a least squares fit, there are no sums of squares. Deviance plays a similar role. For example, to test the utility of the model, use the statistic Null deviance − Residual deviance = 21.756 which, under H0 : β1 = β2 = 0, is χ2-distributed with 30 − 28 = 2 degrees of freedom. P(χ2

2 ≥ 21.756) = 1.9 × 10−5, so we reject H0.

11 / 15 Special Topics Logistic Regression

SLIDE 12

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

You also use deviance to compare nested models, such as the first

rder model

log

π(x)

1 − π(x)

= β0 + β1x1 + β2x2

against the complete second-order model log

π(x)

1 − π(x)

= β0 + β1x1 + β2x2 + β3x1x2 + β4x2

1 + β5x2 2.

summary(glm(STATUS ~ NUMBIDS * DOTEST + I(NUMBIDS^2) + I(DOTEST^2), bids, family = binomial))

12 / 15 Special Topics Logistic Regression

SLIDE 13

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

To test H0 : β3 = β4 = β5 = 0, the test statistic is (Residual deviance for reduced model)− (Residual deviance for complete model) Under H0, this statistic has the χ2-distribution with 28 − 25 = 3 degrees of freedom. Here we have 22.843 − 13.820 = 9.023, which we compare with the χ2

3-distribution.

We find P(χ2

3 ≥ 9.023) = .029, so we would reject H0 at α = .05 but

not at α = .01. That is, there is some evidence that we need second-order terms.

13 / 15 Special Topics Logistic Regression

SLIDE 14

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Prediction Suppose that a new auction has 4 bidders, and the difference between the winning bid and the engineer’s estimate is 30%. What is the probability that the auction was collusive?

predict(g, data.frame(NUMBIDS = 4, DOTEST = 30), type = "response", se.fit = TRUE)

The probability is .85, but the standard error of .13 shows that it is not very well quantified. If you do not specify “type = "response"”, the prediction is on the scale of the log odds, not the probability itself.

14 / 15 Special Topics Logistic Regression

SLIDE 15

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Do not use the standard error of the predicted probability to construct a confidence interval! You can use a confidence interval for the log odds to construct a corresponding confidence interval for the probability:

p <- predict(g, data.frame(NUMBIDS = 4, DOTEST = 30), se.fit = TRUE) logOdds <- p$fit + qnorm(c(.025, .5, .975)) * p$se.fit exp(logOdds) / (1 + exp(logOdds))

15 / 15 Special Topics Logistic Regression