Lecture 18: Review Lecture Ani Manichaikul amanicha@jhsph.edu 15 - - PowerPoint PPT Presentation
Lecture 18: Review Lecture Ani Manichaikul amanicha@jhsph.edu 15 - - PowerPoint PPT Presentation
Lecture 18: Review Lecture Ani Manichaikul amanicha@jhsph.edu 15 May 2007 Types of Biostatistics n 1) Descriptive Statistics n Exploratory Data Analysis n often not in literature n Summaries n "Table 1" in a paper n Goal: visualize
Types of Biostatistics
n 1) Descriptive Statistics
n Exploratory Data Analysis
n often not in literature
n Summaries
n "Table 1" in a paper
n Goal: visualize relationships, generate
hypotheses
Types of Biostatistics
n 2) Inferential Statistics
n Confirmatory Data Analysis
n Methods Section of paper
n Goal: quantify relationships, test
hypotheses
Approach to Modeling
A general approach for most statistical modeling is to:
n Define the Population of Interest n State the Scientific Questions & Underlying
Theories
n Describe and Explore the Observed Data n Define the Model
n Probability part (models the randomness / noise) n Systematic part (models the expectation / signal)
Approach to Modeling
n Estimate the Parameters in the Model
n Fit the Model to the Observed Data
n Make Inferences about Covariates n Check the Validity of the Model
n Verify the Model Assumptions
n Re-define, Re-fit, and Re-check the Model if
necessary
n Interpret the results of the Analysis in terms
- f the Scientific Questions of Interest
Stem-and-Leaf Plots
n Age in years (10 observations)
25, 26, 29, 32, 35, 36, 38, 44, 49, 51 5 6 9 20-29 1 50-59 4 9 40-49 2 5 6 8 30-39 Observations Age Interval
Grouping: Frequency Distribution Tables
n Shows the number of observations for each
range of data
n Intervals can be chosen in ways similar to
stem-and-leaf displays 3 20-29 1 50-59 2 40-49 4 30-39 Frequency Age Interval
Histograms
n Pictures of the frequency or relative
frequency distribution
1 2 3 4 Frequency 1 2 3 4 Age Ca tegory
Histogram of Age
Box-and-Whisker Plots
25 30 35 40 45 50
Age in Years
Box Plot of Age
n IQR = 44 – 29 = 15 n Upper Fence = 44 + 15* 1.5 = 66.5 n Lower Fence = 29 – 15* 1.5 = 6.5
2 Continuous Variables
n Scatterplot
n
Scatterplots visually display the relationship between two continuous variables
150 160 170 180 190 Height in Centimeters 25 30 35 40 45 50 Age in Years
Age by Height in cm
Why is the power of a test important?
n Power indicates the chance of finding a
“significant” difference when there really is one
n Low power: like to obtain non-significant
results even when significant differences exist
n High power is desirable!
n Low power is usually cause by small
sample size
We’re not always right
Errors in Hypothesis Testing α
n Aim: to keep Type I error small by
specifying a small rejection region
n α is set before performing a test,
usually at 0.05
Errors in Hypothesis Testing β
n Aim: To keep Type II error small and
thus power high
β: Probability of Type II Error
n The value of β is usually unknown since it
depends on a specified alternative value.
n β depends on sample size and α. n Before data collection, scientists decide
n the test they will perform n α n the desired β
n They will use this information to choose the
sample size
P-Values
n Definition: The p-value for a hypothesis
test is the probability of obtaining by chance, alone, when H0 is true, a value of the test statistic as extreme or more extreme (in the appropriate direction) than the one actually
- bserved.
Steps of Hypothesis Testing
n Define the null hypothesis, H0. n Define the alternative hypothesis, Ha, where
Ha is usually of the form “not H0”.
n Define the type 1 error, α, usually 0.05. n Calculate the test statistic n Calculate the P-value n If the P-value is less than α, reject H0.
Otherwise fail to reject H0.
Why use linear regression?
n Linear regression is very powerful. It
can be used for many things:
n Binary X n Continuous X n Categorical X n Adjustment for confounding n Interaction n Curved relationships between X and Y
20
SLR: Y= 0+ 1X1
n Linear regression is used for continuous
- utcome variables
n 0: mean outcome when X= 0 (Center!) n Binary X = “dummy variable” for group
n 1: mean difference in outcome between
groups
n Continuous X
n 1: mean difference in outcome corresponding
to a 1-unit increase in X
n Center X to give meaning to 0
n Test 1= 0 in the population
Assumptions of Linear Regression
n L
Linear relationship
n I
Independent observations
n N Normally distributed around line n E
Equal variance across X’s
In Simple Linear Regression
n In simple linear regression (SLR):
n One Predictor / Covariate / Explanatory Variable:
X
n In multiple linear regression (MLR):
n Same Assumptions as SLR, (i.e. L.I.N.E.), but: n More than one Covariate: X1, X2, X3, …, Xp
Model:
n
Y ~ N(µ, σ2)
n
µ = E(Y | X) = β0 + β1X1 + β2X2 + β3X3 + ... βpXp
Regression Methods
Regression Methods
Nested models
n One model is nested within another if
the parent model contains one set of variables and the extended model contains all of the original variables plus
- ne or more additional variables.
Difference in assessing variables: “nested models”
n other predictor(s)
n assess with t test if single variable defines
predictor
n assess with F test (today) if two or more
variables are needed to define the predictor
n potential confounder(s)
n compare CI of primary predictor to see
whether new parameter is significantly different
The F test
( )
( )
nested nested nested parent
- bs
df residual RSS added variables new
- f
# RSS RSS F − =
( )
4 . 4 22 8 . 49 2 8 . 49 6 . 69 F
- bs
= − =
What is Fcr? H0: all new ’s=0 in population HA: at least one new is not 0 in population
The F test: notes
n The F test can be used to compare any two
nested models
n If only one variable is added, it’s easier to
compare the models using the t test for that variable
n t2= F if one variable is added
n For any regression, the estimated variance of
the residuals is RSS/(residual df)
Nested Models
n Comparing nested models
n 1 new variable: use t test for that variable n 2+ new variables: use F test
n Categorical predictor
n set one group as reference n create dummy variable for other groups n include/exclude all dummy variables n evaluate categorical predictor with F test
Effect Modification
n In linear regression, effect modification
is a way of allowing the association between the primary predictor and the
- utcome to change with the level of
another predictor.
n If the 3rd predictor is binary, that results in
a graph in which the two lines (for the two groups) are no longer parallel.
31
Splines and Quadratic Terms
n Splines are used to allow the regression line
to bend
n the breakpoint is arbitrary and decided graphically
- r by hypothesis
n the actual slope above and below the breakpoint
is usually of more interest than the coefficient for the spline (ie the change in slope)
n Quadratic term allows for curvature in the
model
Logistic regression
n For binary outcomes n Model log odds probability, which we
also call the logit
n Baseline term interpreted as log odds n Other coefficients are log odds ratios
Logistic regression model
[ ]
= Tx) | relief P(no Tx) | P(relief log Tx) | f
- dds(Relie
log
= β0 + β1Tx 0 if Placebo where: Tx = 1 if Drug
Then…
n log( odds(Relief|Drug) ) = β0 + β1 n log( odds(Relief|Placebo) ) = β0 n log( odds(R|D)) – log( odds(R|P)) = β1
And…
n Thus:
log = β1
n And:
OR = exp(β1) = eβ1 !!
n So: exp(β1) = odds ratio of relief for
patients taking the Drug-vs-patients taking the Placebo.
P) |
- dds(R
D) |
- dds(R
Logistic Regression
Logit estimates Number of obs = 70 LR chi2(1) = 2.83 Prob > chi2 = 0.0926 Log likelihood = -46.99169 Pseudo R2 = 0.0292
- y | Coef. Std. Err. z P>|z|
[95% Conf. Interval]
- ------------+----------------------------------------------------------------
drug | .8137752 .4889211 1.66 0.096 -.1444926 1.772043 _cons | -.2876821 .341565 -0.84 0.400 -.9571372 .3817731
- Estimates:
log( odds(relief) ) =
= -0.288 + 0.814(Drug) Therefore: OR = exp(0.814) = 2.26 !
Drug 1
ˆ ˆ β β
+
Adding other variables
n What if
Pr(relief) = function of Drug or Placebo AND Age
n We could easily include age in a model
such as: log( odds(relief) ) = β0 + β1Drug + β2Age
Logistic Regression
n As in MLR, we can include many
additional covariates.
n For a Logistic Regression model with p
predictors: log ( odds(Y= 1)) = β0 + β1X1 + ... + βpXp where: odds(Y= 1) = =
) 1 Pr( 1 ) 1 Pr( = − = Y Y
) Pr( ) 1 Pr( = = Y Y
Types of interpretation
n 0+ 1 = ln(odds) (for X= 1)
n 1 = difference in log odds
n
= odds (for X= 1)
n
= odds ratio
n But we started with P(Y= 1).
Can we find that?
1
- e
+
1
- e
More useful math
n n n
( )
1 1
- e
1 e 1 X for robability p so
+ +
+ = =
- dds
1
- dds
robability p + =
robability p 1 robability p
- dds
− =
Nested models
n Adding a single new variable to the model
n null model: n full model:
( )
30 Age
- p
1 p ln
1
− + = −
( ) ( )
min Multivita
- 30
Age
- p
1 p ln
2 1
+ − + = −
Comparing nested models that differ by one variable
n Compare models with p-value or CI
n What method is this?
n The Wald test, a test that applies the CLT, like
n Z test comparing proportions in 2x2 table n analogous to the t test for linear regression
n H0: the new variable is not needed n or H0: new= 0 in the population
Conclusion from the Wald test
n The p-value for multivitamin is 0.007 (< 0.05)
and the CI for coefficient multivitamin does not include 0 (CI for OR doesn’t include 1)
n Reject H0 n Conclude that the larger model is better:
after adjusting for age, multivitamin use is still an important predictor of physician visits in the population
Interpretation - log odds
n 0: the log odds of not visiting a physician
for a 30-year-old person who reports not regularly taking multivitamins
n 1: the log odds ratio of not visiting a
physician for a one year increase in age controlling for multivitamin use
n 2: the log odds ratio of not visiting a
physician for those who take multivitamins compared with those who do not, adjusting for age
Interpretation – odds and
- dds ratio
n exp{ 0} : the odds of not visiting a
physician for a 30-year-old person who reports not regularly taking multivitamins
Interpretation – odds and
- dds ratio
n exp{ 1} : after adjusting for
multivitamin use, the odds ratio of not visiting a physician changes by a factor
- f exp{ 1} = 1.001 for each additional
year of age
n additional age is associated with lower
frequency of physician visits in these students, but the association is not statistically significant (p> 0.05)
Interpretation – odds and
- dds ratio
n exp{ 2} : the odds ratio of not visiting a
physician for those who take multivitamins compared with those who do not is exp{ 2} = 0.46, adjusting for age
n taking multivitamins is associated with regular
physician visits (p= 0.007)
Interpretation In General
n Also: log
= β1
n And: OR
= exp(β1) !!
n exp(β1) is the Multiplicative change in
- dds for a 1 unit increase in X1 provided
X2 is held constant.
n The result is similar for X2
= + = ) 2 X , 1 X | 1
- dds(Y
) 2 X 1, 1 X | 1
- dds(Y
CHD by smoking and coffee
n Yi = 1 if CHD case, 0 if control n COFi = 1 if Coffee Drinker, 0 if not n SMKi = 1 if Smoker, 0 if not n pi = Pr (Yi = 1) n ni = Number observed at patterni of Xs
Logistic Regression Model
n Yi are from a Binomial (ni, pi)
distribution
n Yi are independent n log odds (Yi= 1) (or, logit( Yi= 1) ) is a
function of
n Coffee n Smoking n and coffee x smoking interaction
Logistic Regression Model
n Which implies that Pr(Yi= 1) is the
logistic function
2 1 3 2 2 1 1
- 2
1 3 2 2 1 1
e 1 e
- i
X i X i X i X i i i i
X X X X i
p
β β
β β
+ + +
+ =
+ + +
i i i i i i
SMK COF SMK COF p p
3 2 1
1 log β β β β + + + = −
Interpretations
n exp{ 1} : odds ratio of being a CHD case
for coffee drinkers -vs- non-drinkers among non-smokers
n exp{ 13} : odds ratio of being a CHD
case for coffee drinkers -vs- non- drinkers among smokers
Interpretations
n exp{ 2} : odds ratio of being a CHD case
for smokers -vs- non-smokers among non-coffee drinkers
n exp{ 23} : odds ratio of being case
for smokers -vs- non-smokers among coffee drinkers
Interpretations
n
fraction of cases among non- smoking non-coffee drinking individuals in the sample (determined by sampling plan)
n exp{ 3} : ratio of odds ratios
1
β β
e e +
exp{ 3} Interpretations
n exp{ 3} : factor by which odds ratio of being
a CHD case for coffee drinkers -vs- nondrinkers is multiplied for smokers as compared to non-smokers
- r
n exp{ 3} : factor by which odds ratio of being a
CHD case for smokers -vs- non-smokers is multiplied for coffee drinkers as compared to non-coffee drinkers
Some Special Cases
n Given n If 1 = 2 = 3 = 0 n Neither smoking nor coffee drinking is
associated with increased risk of CHD
SMK COF SMK COF Y Y * ) Pr( ) 1 Pr( log
3 2 1
β β β β + + + = = =
Some Special Cases
n Given n If 1 = 3 = 0 n Smoking, but not coffee drinking, is
associated with increased risk of CHD
SMK COF SMK COF Y Y * ) Pr( ) 1 Pr( log
3 2 1
β β β β + + + = = =
Some Special Cases
n If 3 = 0 n Smoking and coffee drinking are both
associated with risk of CHD but the odds ratio
- f CHD-smoking is the same at levels of
coffee
n Smoking and coffee drinking are both
associated with risk of CHD but the odds ratio
- f CHD-coffee is the same at levels of
smoking.
Confounding
n In epidemiological terms, Z is a “confounder”
- f the relationship of Y with X if Z is related
to both X and Y and Z is not in the causal pathway between X and Y
n In statistical terms, Z is a “confounder” of the
relationship of Y with X if the X coefficient changes when Z is added to a regression of Y
- n X
Confounding
n For example, consider the two models
Y = 0 + 1X + 1 Y = 0 + 1X + 2Z + 2
n then Z is a confounder of the X, Y
relationship if 1 1
Look at Confidence Intervals
n Without Smoking
OR = e0.79 = 2.2
n 95% CI for log(OR): 0.79 ± 1.96(0.33)
= (0.13, 1.44)
n 95% CI for OR: (e0.13, e1.44)
= (1.14, 4.24)
Look at Confidence Intervals
n With Smoking (adjusting for smoking)
OR = e0.53 = 1.7
n 95% CI for log(OR): 0.53 ± 1.96(0.35)
= (-0.17, 1.22)
n 95% CI for OR: (e-0.17, e1.22)
= (0.85, 3.39)
Conclusion
n So, ignoring smoking, the CHD and
coffee OR is 2.2 (95% CI: 1.14 - 4.26)
n Adjusting for smoking, gives more
modest evidence for a coffee effect
n In this case-control study, smoking is a
weak-to-moderate confounder of the coffee-CHD association
Interaction Model
Model 3 2.4 .55 1.3 Smoking
- .59
.73
- .43
Coffee* Smoking 1.5 .45 .69 Coffee
- 3.4
.30
- 1.0
Intercept z se Est Variable
Testing Interaction Term
n Z= -0.59, p-value = 0.554 n 95% Confidence interval for 13
n (0.42, 3.99)
n Both of the above suggest that there is
little evidence that smoking is an effect modifier!
Likelihood Ratio Test
n The Likelihood Ratio Test will help decide
whether or not additional term(s) “significantly” improve the model fit
n Likelihood Ratio Test (LRT) statistic for
comparing nested models is
n -2 times the difference between the log likelihoods
(LLs) for the Null -vs- Extended models
n the obtained is identical to from an
analysis of variance test for linear regression models
Likelihood Ratio Test
Deviance is a term used for the difference in
- 2* log likelihood relative to the best possible value from
a perfectly predicting model. Change in deviance is the same as change in -2LL.
LRT Example
Model comparisons using likelihood ratio test
Summary: Unadjusted ORs
n The odds of CHD was estimated to be
3.4 times higher among smokers compared to non-smokers
n 95% CI: (1.7, 7.9)
n The odds of CHD was estimated to be
2.2 times higher among coffee drinkers compared to non-coffee drinkers
n 95% CI: (1.1, 4.3)
Summary: Adjusted ORs
n Controlling for the potential
confounding of smoking, the coffee
- dds ratio was estimated to be 1.7 with
95% CI: (.85, 3.4).
n Hence, the evidence in these data are
insufficient to conclude coffee has an independent effect on CHD beyond that
- f smoking.
Comparing the models
n Models C and F are both nested in
Model A
n Models C and F cannot be directly
compared to one another, but we can see which has a smaller p-value when compared to Model A
n C vs. A: X2 = 26.5 with 2 df n F vs. A: X2 = 21.7 with 3 df
What next?
n Model C improves prediction beyond gender
alone (Model A) more than Model F.
n Model C should be the next parent model,
and we should test the new variables in Model F to see if they continue to improve prediction within the context of Model C.
n When a tentative final model is identified, the
assumptions of logistic regression should be checked.
74
Flexibility in linear models
n A spline allows the “slope” for a
continuous predictor to change at a given point; the coefficient is for the
difference in log odds ratio
n An interaction term allows the odds
ratio for one variable to differ by the value of a second variable; the coefficient is for the difference in log
- dds ratio
Poisson regression model
n Log-linear model for mean rate
where p is the number of predictors in the model
n Random component: n Here:
Exponentiating Poisson regression models
Interpreting Poisson regression parameters
Modelling rates
n Of key interest in Poisson regression
models is to make inference about rates
- f events
n We are often interested in whether the
rate of cancer, or some other disease, varies by population subgroups such as gender, race, or age
Person-years
n In defining rates, it is crucial to state
what denominator we have in mind
n For disease, we are usually interested in
disease rate per person, per year
n If the HIV incidence rate is 5 per 1
million person years, that means we expect to see 5 new cases of HIV per 1 million persons per year
Modelling Danish Cancer cases with an offset
n We observed Danish cancer cases in 6
age groups over a period of 4 years
n The model:
predicts log rates per 10,000 person years
Interpretation of coefficients
More about offsets
n The purpose of an offset is to specify
the denominator of the predicted rates
n We should always try to use an offset if
we suspect the underlying population sizes vary for the observed counts
n Typically, we’ll use log(N) as the offset,
where N is the sample size or number
- f person years generating each count
Poisson regression for cohort studies
n Log-linear regression can be used to estimate
relative risks for cohort studies (but not case control)
n Relative risks is like relative rates, but we are
comparing risks (probability of disease) instead of rates (expected cases per person- year) across groups
n Could also estimate relative risk by
transforming results from logistic regression
Grand summary
n Exploratory analysis includes graphs
and tables – good to get a feel for the data
n Confirmatory analysis is useful for
making definitive conclusions
n Linear models provide us with a
framework in which to perform confirmatory analysis in many settings
Grand summary: linear models
n Linear regression: for continuous
(normal) outcomes
n Logistic regression: for binary outcomes n Poisson regression: for counts
Grand summary: modelling
n In all generalized linear models, we can
use the following tools to make models more flexible:
n Adjust for confounders using additive
covariates
n Effect modification allows by interaction
terms
n Curved and bent lines through polynomials
and splines
Grand summary: testing
n We can test significance of a single
predictor using z-test (or t-test for linear regression)
n Test significance of several covariates
using a pair of nested models by a likelihood ratio test
n Know how to interpret p-values and