An Introduction to Logistic Regression
Emily Hector
University of Michigan
June 19, 2019
1 / 39
An Introduction to Logistic Regression Emily Hector University of - - PowerPoint PPT Presentation
An Introduction to Logistic Regression Emily Hector University of Michigan June 19, 2019 1 / 39 Modeling Data I Types of outcomes I Continuous, binary, counts, ... I Dependence structure of outcomes I Independent observations I Correlated
University of Michigan
1 / 39
I Types of outcomes
I Continuous, binary, counts, ...
I Dependence structure of outcomes
I Independent observations I Correlated observations, repeated measures
I Number of covariates, potential confounders
I Controlling for confounders that could lead to spurious results
I Sample size
2 / 39
I Linear regression is the type of regression we use for a
I Logistic regression is the type of regression we use for a binary
I Bernoulli Distribution I Linear Regression
3 / 39
I Y ∼ Bernoulli(p) takes values in {0, 1},
I e.g. a coin toss
I Y = 1 for a success, Y = 0 for failure, I p = probability of success, i.e. p = P(Y = 1),
I e.g. p = 1
2 = P(heads)
I Mean is p, Variance is p(1 − p).
4 / 39
I When do we use linear regression?
10 20 30 10 20 30 40 50
X Y
I How can this model break down?
5 / 39
I Yi|Xi ∼ Bernoulli(pXi), I E(Yi) = —0 + —1Xi = ‚
I Linear relationship between X
I Normally distributed errors? I Constant variance of Y ? I Is ‚
0.00 0.25 0.50 0.75 1.00 10 20 30 40 50
X Y 6 / 39
I The relationship between X and Y is not linear. I The response Y is not normally distributed. I The variance of a Bernoulli random variable depends on its
I Fitted value of Y may not be 0 or 1, since linear models produce
7 / 39
I Instead of modeling Y , model
I Use a function that constrains
8 / 39
I Let Y be a binary outcome and X a covariate/predictor. I We are interested in modeling px = P(Y = 1|X = x), i.e. the
I log
pX 1−pX
I pX = eβ0+β1X 1+eβ0+β1X I
x→−∞ ex 1+ex = 0 and lim x→∞ ex 1+ex = 1, so 0 ≤ px ≤ 1.
9 / 39
I Assume Yi|Xi ∼ Bernoulli(pXi) and
xi × (1 − pxi)1−yi I Binomial likelihood: L(px|Y, X) = N
i=1
xi(1 − pxi)1−yi I Binomial log-likelihood:
N
i=1
pxi 1−pxi
I Logistic regression log-likelihood:
N
i=1
I No closed form solution for Maximum Likelihood Estimates of —
I Numerical maximization techniques required.
10 / 39
pX 1−pX
I Then pX 1−pX is called the odds of success, I log
pX 1−pX
11 / 39
I Since p ∈ [0, 1], the log odds is log[p/(1 − p)] ∈ (−∞, ∞). I So while linear regression estimates anything in (−∞, +∞), I logistic regression estimates a proportion in [0, 1].
12 / 39
P (Y =1) 1−P (Y =1)
P (Y =1) 1−P (Y =1)
I The odds of an event are defined as
13 / 39
a a+b/ b a+b c c+d/ d c+d
14 / 39
I Odds Ratios (OR) can be useful for comparisons. I Suppose we have a trial to see if an intervention T reduces
I The OR describes the benefits of intervention T:
I OR< 1: the intervention is better than the placebo since
I OR= 1: there is no difference between the intervention and the
I OR> 1: the intervention is worse than the placebo since
15 / 39
I —0 is the log of the odds of success at zero values for all covariates. I eβ0 1+eβ0 is the probability of success at zero values for all covariates I Interpretation of eβ0 1+eβ0 depends on the sampling of the dataset
I Population cohort: disease prevalence at X = x I Case-control: ratio of cases to controls at X = x 16 / 39
pX+1 1−pX+1
pX 1−pX
I If —1 = 0, there is no association between changes in X and
I If —1 > 0, there is a positive association between X and p
I If —1 < 0, there is a negative association between X and p
17 / 39
I OR> 1: positive relationship: as X increases, the probability of
I OR< 1: negative relationship: as X increases, probability of Y
I OR= 1: no association; exposure (X = 1) does not affect odds of
18 / 39
I OR is the ratio of the
p1 1−p1
p2 1−p2
I OR= 1 when p1 = p2. I Interpretation of odds
Probability of Success (p1) Solid Lines are Odds Ratios, Dashed Lines are Log Odds Ratios OR=1 Log(OR)=0 19 / 39
I Let X1 be a continuous variable, X2 an indicator variable (e.g.
I Set —0 = −0.5, —1 = 0.7, —2 = 2.5.
20 / 39
21 / 39
I The WCGS was a prospective cohort study of 3524 men aged
I Follow-up for CHD incidence was terminated in 1969. I 3154 men were CHD free at baseline. I 275 men developed CHD during the study. I The estimated probability a person in WCGS develops CHD is
I This is an unadjusted estimate that does not account for other
I How do we use logistic regression to determine factors that
22 / 39
23 / 39
24 / 39
25 / 39
I The effect of HT is significant (p = 3.68 × 10−6) I The odds of developing CHD is 1.92 times higher in
26 / 39
I Yes, CHD risk is significantly associated with increased age
I The OR = exp(0.0744) = 1.08; 95% C.I. (1.05, 1.1). I For a 1-year increase in age, the log odds of a CHD event
27 / 39
28 / 39
29 / 39
I (0.6518 − 0.5684)/0.6518 = 12.8%. I Since the change in effect size is > 10%, we would consider
30 / 39
31 / 39
I The interaction effect is significant (p = 0.0386). I Odds ratio for 1lb. increase in weight for those without
I Odds ratio for 1lb. increase in weight for those with
32 / 39
33 / 39
I The effect of increasing weight on CHD risk is different between
I For those without hypertension, increase in weight leads to an
I For those with hypertension, the risk of CHD is nearly constant
34 / 39
I Fit model and obtain the estimated coefficients. I Calculate predicted probability ‚
0.00 0.25 0.50 0.75 1.00 −10 −5 5 10
Values of X Probability 35 / 39
36 / 39
37 / 39
38 / 39
I Logistic regression models the log of the odds of an outcome.
I Used when the outcome is binary.
I We interpret odds ratios (exponentiated coefficients) from logistic
I We can control for confounding factors and assess interactions in
I Many of the concepts that apply to multiple linear regression
39 / 39