S01 - Logistic Regression STAT 401 (Engineering) - Iowa State - PowerPoint PPT Presentation

S01 - Logistic Regression STAT 401 (Engineering) - Iowa State University April 23, 2018 (STAT401@ISU) S01 - Logistic Regression April 23, 2018 1 / 19

Linear regression The linear regression model ind ∼ N ( µ i , σ 2 ) Y i µ i = β 0 + β 1 X i, 1 + · · · + β p X i,p where Y i is continuous X i is continuous or categorical (indicator variables) What if Y i is binary or a count (of the number of success out of some total)? (STAT401@ISU) S01 - Logistic Regression April 23, 2018 2 / 19

Logistic regression Logistic regression Let � 1 if observation i is a “success’ Y i = 0 otherwise . and X i be an explanatory variable that affects the probability of success θ i for observation i . Then a logistic regression model is � θ i � ind ∼ Ber ( θ i ) and logit ( θ i ) = log = β 0 + β 1 X i Y i 1 − θ i where the logistic function of X i is e β 0 + β 1 X i 1 θ i = f ( X i ) = 1 + e β 0 + β 1 X i = 1 + e − ( β 0 + β 1 X i ) . (STAT401@ISU) S01 - Logistic Regression April 23, 2018 3 / 19

Logistic regression d <- expand.grid(b0 = c(-1,0,1), b1 = c(-1,0,1), x = seq(-4,4,by=0.1)) %>% mutate(theta = 1/(1+exp(-(b0+b1*x))), beta0 = as.factor(b0), beta1 = as.factor(b1)) ggplot(d, aes(x,theta,color=beta0,linetype=beta1,group=interaction(beta0,beta1))) + geom_line() + theme_bw() + labs(x="Explanatory variable (x)", y="Probability of success") 1.00 0.75 beta0 −1 Probability of success 0 1 0.50 beta1 −1 0 1 0.25 0.00 −4 −2 0 2 4 Explanatory variable (x) (STAT401@ISU) S01 - Logistic Regression April 23, 2018 4 / 19

Logistic regression Interpretation When X i = 0 , then 1 E [ Y i | X i = 0] = θ i = 1 + e − β 0 thus β 0 determines the probability of success when the explanatory variable is zero. The odds of success when X 1 = x is θ 1 = e β 0 + β 1 x . 1 − θ 1 The probability of success when X 2 = x + 1 is θ 2 = e β 0 + β 1 ( x +1) = e β 0 + β 1 x + β 1 . 1 − θ 2 Thus, the multiplicative change in the odds for a 1 unit increase in x is θ 2 = e β 0 + β 1 x + β 1 1 − θ 2 = e β 1 θ 1 e β 0 + β 1 x 1 − θ 1 This is also referred to as an odds ratio. (STAT401@ISU) S01 - Logistic Regression April 23, 2018 5 / 19

Logistic regression Example Lung cancer due to smoking To prove a causal relationship between lung cancer and smoking, there should be clear evidence that there is a dose response between lung cancer and smoking. But since lung cancer is binary, we need to compare the proportion of individuals who have lung cancer to those who don’t amongst the individuals who smoke about the same amount. To investigate the causes of lung cancer, researchers conducted a case-control study where the 49 cases of individuals with lung cancer where matched with 98 controls from a population of residents having the same general age structure. (In case-control studies, the intercept does not have our standard interpretation because it is determined by our sampling.) (STAT401@ISU) S01 - Logistic Regression April 23, 2018 6 / 19

Logistic regression Example lung_cancer <- Sleuth3::case2002 %>% mutate(`Lung Cancer` = ifelse(LC=="NoCancer", "No","Yes"), `Cigarettes Per Day` = CD) ggplot(lung_cancer, aes(`Cigarettes Per Day`, `Lung Cancer`)) + geom_jitter(height=0.1) + theme_bw() Yes Lung Cancer No 0 10 20 30 40 Cigarettes Per Day (STAT401@ISU) S01 - Logistic Regression April 23, 2018 7 / 19

Logistic regression Example Analysis m <- glm(`Lung Cancer`=="Yes" ~ `Cigarettes Per Day`, data = lung_cancer, family = "binomial") summary(m) Call: glm(formula = `Lung Cancer` == "Yes" ~ `Cigarettes Per Day`, family = "binomial", data = lung_cancer) Deviance Residuals: Min 1Q Median 3Q Max -1.5148 -0.9688 -0.7166 1.3449 1.8603 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.53541 0.37707 -4.072 4.66e-05 *** `Cigarettes Per Day` 0.05113 0.01939 2.637 0.00836 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 187.14 on 146 degrees of freedom Residual deviance: 179.62 on 145 degrees of freedom AIC: 183.62 Number of Fisher Scoring iterations: 4 (STAT401@ISU) S01 - Logistic Regression April 23, 2018 8 / 19

Logistic regression Example ggplot(lung_cancer, aes(`Cigarettes Per Day`, 1*(`Lung Cancer` == "Yes"))) + geom_jitter(height=0.1) + stat_smooth(method="glm", se=FALSE, method.args = list(family="binomial")) + theme_bw() + scale_y_continuous(breaks=c(0,1), labels=c("No","Yes")) + labs(y = "Lung Cancer") Yes Lung Cancer No 0 10 20 30 40 Cigarettes Per Day (STAT401@ISU) S01 - Logistic Regression April 23, 2018 9 / 19

Logistic regression Example Grouping Often data are grouped: lung_cancer_grouped <- lung_cancer %>% group_by(`Cigarettes Per Day`) %>% summarize(`Number of individuals` = n(), `Number with lung cancer` = sum(`Lung Cancer` == "Yes"), `Number without lung cancer` = sum(`Lung Cancer` == "No"), `Proportion with lung cancer` = `Number with lung cancer`/`Number of individuals`) lung_cancer_grouped # A tibble: 19 x 5 `Cigarettes Per Day` `Number of individuals` `Number with lung cancer` `Number without lung cancer` `Proporti <int> <int> <int> <int> 1 0 17 1 16 2 1 2 0 2 3 2 2 0 2 4 3 1 1 0 5 4 2 0 2 6 5 2 0 2 7 6 3 1 2 8 8 2 0 2 9 10 15 4 11 10 12 5 1 4 11 15 27 12 15 12 16 1 1 0 13 18 2 0 2 14 20 38 16 22 15 25 15 7 8 16 30 7 3 4 17 37 1 0 1 (STAT401@ISU) S01 - Logistic Regression April 23, 2018 10 / 19 18 40 4 2 2

Logistic regression Example Binomial distribution The probability mass function of the binomial distribution is � n � θ y (1 − θ ) n − y P ( Y = y ) = y = 0 , 1 , 2 , . . . , n y Properties: E [ Y ] = nθ V [ Y ] = nθ (1 − θ ) xx = 0:10 plot(xx, dbinom(xx, 10, .3), main="Probability mass function for Bin(10,.3)", xlab="y", ylab="P(Y=Y)", pch=19) Probability mass function for Bin(10,.3) 0.25 0.20 0.15 P(Y=Y) 0.10 0.05 0.00 0 2 4 6 8 10 y (STAT401@ISU) S01 - Logistic Regression April 23, 2018 11 / 19

Logistic regression Example Logistic regression for grouped data Let Y i be the number of success out of n i attempts in group i . Then a logistic regression model is ind ∼ Bin ( n i , θ i ) Y i � � θ i logit ( θ i ) = log = β 0 + β 1 X i 1 − θ i where Y i is an integer from 0 to n i Bin refers to the binomial distribution (STAT401@ISU) S01 - Logistic Regression April 23, 2018 12 / 19

Logistic regression Example Logistic regression in R m = glm(cbind(`Number with lung cancer`, `Number without lung cancer`) ~ `Cigarettes Per Day`, data = lung_cancer_grouped, family="binomial") summary(m) Call: glm(formula = cbind(`Number with lung cancer`, `Number without lung cancer`) ~ `Cigarettes Per Day`, family = "binomial", data = lung_cancer_grouped) Deviance Residuals: Min 1Q Median 3Q Max -1.5148 -1.0253 -0.5070 0.3305 1.7922 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.53541 0.37707 -4.072 4.66e-05 *** `Cigarettes Per Day` 0.05113 0.01939 2.637 0.00836 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 28.651 on 18 degrees of freedom Residual deviance: 21.141 on 17 degrees of freedom AIC: 48.879 Number of Fisher Scoring iterations: 4 confint(m) (STAT401@ISU) S01 - Logistic Regression April 23, 2018 13 / 19

Logistic regression Multiple explanatory variables Effect of birdkeeping on lung cancer The data set we have been analyzing was actually constructed to investigate the relationship between birdkeeping and lung cancer. But, since we know smoking increase the probability of developing lung cancer, we want to control for the effect of smoking when assessing the effect of bird keeping. Thus, we will run a logistic regression with both smoking and bird-keeping to determine the effect of bird-keeping on lung cancer. (STAT401@ISU) S01 - Logistic Regression April 23, 2018 14 / 19

Logistic regression Multiple explanatory variables Summarize data lung_cancer_bird <- Sleuth3::case2002 %>% group_by(CD, BK) %>% summarize(y = sum(LC == "LungCancer"), n = n(), p = y/n) lung_cancer_bird # A tibble: 30 x 5 # Groups: CD [?] CD BK y n p <int> <fct> <int> <int> <dbl> 1 0 Bird 1 8 0.125 2 0 NoBird 0 9 0 3 1 NoBird 0 2 0 4 2 NoBird 0 2 0 5 3 NoBird 1 1 1.00 6 4 Bird 0 1 0 7 4 NoBird 0 1 0 8 5 Bird 0 1 0 9 5 NoBird 0 1 0 10 6 Bird 1 1 1.00 # ... with 20 more rows (STAT401@ISU) S01 - Logistic Regression April 23, 2018 15 / 19

Logistic regression Multiple explanatory variables Visualize data ggplot(lung_cancer_bird, aes(CD, p, size=n, color=BK, shape=BK)) + geom_point() + theme_bw() + labs(x="Cigarettes per day", y="Proportion with lung cancer") 1.00 0.75 BK Proportion with lung cancer Bird NoBird n 0.50 5 10 15 20 0.25 0.00 0 10 20 30 40 Cigarettes per day (STAT401@ISU) S01 - Logistic Regression April 23, 2018 16 / 19

S01 - Logistic Regression STAT 401 (Engineering) - Iowa State - PowerPoint PPT Presentation

S01 - Logistic Regression STAT 401 (Engineering) - Iowa State University April 23, 2018 (STAT401@ISU) S01 - Logistic Regression April 23, 2018 1 / 19 Linear regression The linear regression model ind N ( i , 2 ) Y i i = 0 +

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Collaboration (IFSAC): Moving Forward Together IFSAC Webinar Presented by, Christopher Braden,

Reviewing Drowning Deaths: Key Questions and Prevention Resources Tuesday, May 19, 2020 2:00 PM

Penn MEASURE M esothelioma and the E nvironment in A mbler S tudy: U nderstanding R isk from E

conclusions from observational clinical data George Hripcsak, MD, MS Biomedical Informatics,

Center for Medicare and Medicaid Innovation Center for Medicare and Medicaid Innovation

STAGING and HIGH-RISK TREATMENT of IDENTIFICATION HIGH-RISK and ADVANCED Endometrial Cancer

Webinar #6 The Role of Research in National Cancer Control Planning and Implementation Brenda

Basic epidemiology for the vaccinologist Tony Hawkridge What is epidemiology? The study of

Sambuz

Useful Links

Newsletter

Mail Us