[PPT] - Lecture 18: Review Lecture Ani Manichaikul amanicha@jhsph.edu 15 PowerPoint Presentation

SLIDE 1

Lecture 18: Review Lecture

Ani Manichaikul amanicha@jhsph.edu 15 May 2007

SLIDE 2

Types of Biostatistics

n 1) Descriptive Statistics

n Exploratory Data Analysis

n often not in literature

n Summaries

n "Table 1" in a paper

n Goal: visualize relationships, generate

hypotheses

SLIDE 3

Types of Biostatistics

n 2) Inferential Statistics

n Confirmatory Data Analysis

n Methods Section of paper

n Goal: quantify relationships, test

hypotheses

SLIDE 4

Approach to Modeling

A general approach for most statistical modeling is to:

n Define the Population of Interest n State the Scientific Questions & Underlying

Theories

n Describe and Explore the Observed Data n Define the Model

n Probability part (models the randomness / noise) n Systematic part (models the expectation / signal)

SLIDE 5

Approach to Modeling

n Estimate the Parameters in the Model

n Fit the Model to the Observed Data

n Make Inferences about Covariates n Check the Validity of the Model

n Verify the Model Assumptions

n Re-define, Re-fit, and Re-check the Model if

necessary

n Interpret the results of the Analysis in terms

f the Scientific Questions of Interest

SLIDE 6

Stem-and-Leaf Plots

n Age in years (10 observations)

25, 26, 29, 32, 35, 36, 38, 44, 49, 51 5 6 9 20-29 1 50-59 4 9 40-49 2 5 6 8 30-39 Observations Age Interval

SLIDE 7

Grouping: Frequency Distribution Tables

n Shows the number of observations for each

range of data

n Intervals can be chosen in ways similar to

stem-and-leaf displays 3 20-29 1 50-59 2 40-49 4 30-39 Frequency Age Interval

SLIDE 8

Histograms

n Pictures of the frequency or relative

frequency distribution

1 2 3 4 Frequency 1 2 3 4 Age Ca tegory

Histogram of Age

SLIDE 9

Box-and-Whisker Plots

25 30 35 40 45 50

Age in Years

Box Plot of Age

n IQR = 44 – 29 = 15 n Upper Fence = 44 + 15* 1.5 = 66.5 n Lower Fence = 29 – 15* 1.5 = 6.5

SLIDE 10

2 Continuous Variables

n Scatterplot

n

Scatterplots visually display the relationship between two continuous variables

150 160 170 180 190 Height in Centimeters 25 30 35 40 45 50 Age in Years

Age by Height in cm

SLIDE 11

SLIDE 12

Why is the power of a test important?

n Power indicates the chance of finding a

“significant” difference when there really is one

n Low power: like to obtain non-significant

results even when significant differences exist

n High power is desirable!

n Low power is usually cause by small

sample size

SLIDE 13

We’re not always right

SLIDE 14

Errors in Hypothesis Testing α

n Aim: to keep Type I error small by

specifying a small rejection region

n α is set before performing a test,

usually at 0.05

SLIDE 15

Errors in Hypothesis Testing β

n Aim: To keep Type II error small and

thus power high

SLIDE 16

β: Probability of Type II Error

n The value of β is usually unknown since it

depends on a specified alternative value.

n β depends on sample size and α. n Before data collection, scientists decide

n the test they will perform n α n the desired β

n They will use this information to choose the

sample size

SLIDE 17

P-Values

n Definition: The p-value for a hypothesis

test is the probability of obtaining by chance, alone, when H0 is true, a value of the test statistic as extreme or more extreme (in the appropriate direction) than the one actually

bserved.

SLIDE 18

Steps of Hypothesis Testing

n Define the null hypothesis, H0. n Define the alternative hypothesis, Ha, where

Ha is usually of the form “not H0”.

n Define the type 1 error, α, usually 0.05. n Calculate the test statistic n Calculate the P-value n If the P-value is less than α, reject H0.

Otherwise fail to reject H0.

SLIDE 19

Why use linear regression?

n Linear regression is very powerful. It

can be used for many things:

n Binary X n Continuous X n Categorical X n Adjustment for confounding n Interaction n Curved relationships between X and Y

SLIDE 20

20

SLR: Y= 0+ 1X1

n Linear regression is used for continuous

utcome variables

n 0: mean outcome when X= 0 (Center!) n Binary X = “dummy variable” for group

n 1: mean difference in outcome between

groups

n Continuous X

n 1: mean difference in outcome corresponding

to a 1-unit increase in X

n Center X to give meaning to 0

n Test 1= 0 in the population

SLIDE 21

Assumptions of Linear Regression

n L

Linear relationship

n I

Independent observations

n N Normally distributed around line n E

Equal variance across X’s

SLIDE 22

In Simple Linear Regression

n In simple linear regression (SLR):

n One Predictor / Covariate / Explanatory Variable:

X

n In multiple linear regression (MLR):

n Same Assumptions as SLR, (i.e. L.I.N.E.), but: n More than one Covariate: X1, X2, X3, …, Xp

Model:

n

Y ~ N(µ, σ2)

n

µ = E(Y | X) = β0 + β1X1 + β2X2 + β3X3 + ... βpXp

SLIDE 23

Regression Methods

SLIDE 24

Regression Methods

SLIDE 25

Nested models

n One model is nested within another if

the parent model contains one set of variables and the extended model contains all of the original variables plus

ne or more additional variables.

SLIDE 26

Difference in assessing variables: “nested models”

n other predictor(s)

n assess with t test if single variable defines

predictor

n assess with F test (today) if two or more

variables are needed to define the predictor

n potential confounder(s)

n compare CI of primary predictor to see

whether new parameter is significantly different

SLIDE 27

The F test

( )

nested nested nested parent

bs

df residual RSS added variables new

f

# RSS RSS F − =

( )

4 . 4 22 8 . 49 2 8 . 49 6 . 69 F

bs

= − =

What is Fcr? H0: all new ’s=0 in population HA: at least one new is not 0 in population

SLIDE 28

The F test: notes

n The F test can be used to compare any two

nested models

n If only one variable is added, it’s easier to

compare the models using the t test for that variable

n t2= F if one variable is added

n For any regression, the estimated variance of

the residuals is RSS/(residual df)

SLIDE 29

Nested Models

n Comparing nested models

n 1 new variable: use t test for that variable n 2+ new variables: use F test

n Categorical predictor

n set one group as reference n create dummy variable for other groups n include/exclude all dummy variables n evaluate categorical predictor with F test

SLIDE 30

Effect Modification

n In linear regression, effect modification

is a way of allowing the association between the primary predictor and the

utcome to change with the level of

another predictor.

n If the 3rd predictor is binary, that results in

a graph in which the two lines (for the two groups) are no longer parallel.

SLIDE 31

31

Splines and Quadratic Terms

n Splines are used to allow the regression line

to bend

n the breakpoint is arbitrary and decided graphically

r by hypothesis

n the actual slope above and below the breakpoint

is usually of more interest than the coefficient for the spline (ie the change in slope)

n Quadratic term allows for curvature in the

model

SLIDE 32

Logistic regression

n For binary outcomes n Model log odds probability, which we

also call the logit

n Baseline term interpreted as log odds n Other coefficients are log odds ratios

SLIDE 33

Logistic regression model

[ ]

       

= Tx) | relief P(no Tx) | P(relief log Tx) | f

dds(Relie

log

= β0 + β1Tx 0 if Placebo where: Tx = 1 if Drug

SLIDE 34

Then…

n log( odds(Relief|Drug) ) = β0 + β1 n log( odds(Relief|Placebo) ) = β0 n log( odds(R|D)) – log( odds(R|P)) = β1

SLIDE 35

And…

n Thus:

log = β1

n And:

OR = exp(β1) = eβ1 !!

n So: exp(β1) = odds ratio of relief for

patients taking the Drug-vs-patients taking the Placebo.

       

P) |

dds(R

D) |

dds(R

SLIDE 36

Logistic Regression

Logit estimates Number of obs = 70 LR chi2(1) = 2.83 Prob > chi2 = 0.0926 Log likelihood = -46.99169 Pseudo R2 = 0.0292

y | Coef. Std. Err. z P>|z|

[95% Conf. Interval]

------------+----------------------------------------------------------------

drug | .8137752 .4889211 1.66 0.096 -.1444926 1.772043 _cons | -.2876821 .341565 -0.84 0.400 -.9571372 .3817731

Estimates:

log( odds(relief) ) =

= -0.288 + 0.814(Drug) Therefore: OR = exp(0.814) = 2.26 !

Drug 1

ˆ ˆ β β

+

SLIDE 37

Adding other variables

n What if

Pr(relief) = function of Drug or Placebo AND Age

n We could easily include age in a model

such as: log( odds(relief) ) = β0 + β1Drug + β2Age

SLIDE 38

Logistic Regression

n As in MLR, we can include many

additional covariates.

n For a Logistic Regression model with p

predictors: log ( odds(Y= 1)) = β0 + β1X1 + ... + βpXp where: odds(Y= 1) = =

) 1 Pr( 1 ) 1 Pr( = − = Y Y

) Pr( ) 1 Pr( = = Y Y

SLIDE 39

Types of interpretation

n 0+ 1 = ln(odds) (for X= 1)

n 1 = difference in log odds

n

= odds (for X= 1)

n

= odds ratio

n But we started with P(Y= 1).

Can we find that?

1

e

+

1

e

SLIDE 40

More useful math

n n n

( )

1 1

e

1 e 1 X for robability p so

+ +

+ = =

dds

1

dds

robability p + =

robability p 1 robability p

dds

− =

SLIDE 41

Nested models

n Adding a single new variable to the model

n null model: n full model:

( )

30 Age

p

1 p ln

1

− + =         −

( ) ( )

min Multivita

30

Age

p

1 p ln

2 1

+ − + =         −

SLIDE 42

Comparing nested models that differ by one variable

n Compare models with p-value or CI

n What method is this?

n The Wald test, a test that applies the CLT, like

n Z test comparing proportions in 2x2 table n analogous to the t test for linear regression

n H0: the new variable is not needed n or H0: new= 0 in the population

SLIDE 43

Conclusion from the Wald test

n The p-value for multivitamin is 0.007 (< 0.05)

and the CI for coefficient multivitamin does not include 0 (CI for OR doesn’t include 1)

n Reject H0 n Conclude that the larger model is better:

after adjusting for age, multivitamin use is still an important predictor of physician visits in the population

SLIDE 44

Interpretation - log odds

n 0: the log odds of not visiting a physician

for a 30-year-old person who reports not regularly taking multivitamins

n 1: the log odds ratio of not visiting a

physician for a one year increase in age controlling for multivitamin use

n 2: the log odds ratio of not visiting a

physician for those who take multivitamins compared with those who do not, adjusting for age

SLIDE 45

Interpretation – odds and

dds ratio

n exp{ 0} : the odds of not visiting a

physician for a 30-year-old person who reports not regularly taking multivitamins

SLIDE 46

Interpretation – odds and

dds ratio

n exp{ 1} : after adjusting for

multivitamin use, the odds ratio of not visiting a physician changes by a factor

f exp{ 1} = 1.001 for each additional

year of age

n additional age is associated with lower

frequency of physician visits in these students, but the association is not statistically significant (p> 0.05)

SLIDE 47

Interpretation – odds and

dds ratio

n exp{ 2} : the odds ratio of not visiting a

physician for those who take multivitamins compared with those who do not is exp{ 2} = 0.46, adjusting for age

n taking multivitamins is associated with regular

physician visits (p= 0.007)

SLIDE 48

Interpretation In General

n Also: log

= β1

n And: OR

= exp(β1) !!

n exp(β1) is the Multiplicative change in

dds for a 1 unit increase in X1 provided

X2 is held constant.

n The result is similar for X2

         

= + = ) 2 X , 1 X | 1

dds(Y

) 2 X 1, 1 X | 1

dds(Y

SLIDE 49

CHD by smoking and coffee

n Yi = 1 if CHD case, 0 if control n COFi = 1 if Coffee Drinker, 0 if not n SMKi = 1 if Smoker, 0 if not n pi = Pr (Yi = 1) n ni = Number observed at patterni of Xs

SLIDE 50

Logistic Regression Model

n Yi are from a Binomial (ni, pi)

distribution

n Yi are independent n log odds (Yi= 1) (or, logit( Yi= 1) ) is a

function of

n Coffee n Smoking n and coffee x smoking interaction

SLIDE 51

Logistic Regression Model

n Which implies that Pr(Yi= 1) is the

logistic function

2 1 3 2 2 1 1

2

1 3 2 2 1 1

e 1 e

i

X i X i X i X i i i i

X X X X i

p

β β

+ + +

+ =

+ + +

i i i i i i

SMK COF SMK COF p p

3 2 1

1 log β β β β + + + =         −

SLIDE 52

Interpretations

n exp{ 1} : odds ratio of being a CHD case

for coffee drinkers -vs- non-drinkers among non-smokers

n exp{ 13} : odds ratio of being a CHD

case for coffee drinkers -vs- non- drinkers among smokers

SLIDE 53

Interpretations

n exp{ 2} : odds ratio of being a CHD case

for smokers -vs- non-smokers among non-coffee drinkers

n exp{ 23} : odds ratio of being case

for smokers -vs- non-smokers among coffee drinkers

SLIDE 54

Interpretations

n

fraction of cases among non- smoking non-coffee drinking individuals in the sample (determined by sampling plan)

n exp{ 3} : ratio of odds ratios

1

β β

e e +

SLIDE 55

exp{ 3} Interpretations

n exp{ 3} : factor by which odds ratio of being

a CHD case for coffee drinkers -vs- nondrinkers is multiplied for smokers as compared to non-smokers

r

n exp{ 3} : factor by which odds ratio of being a

CHD case for smokers -vs- non-smokers is multiplied for coffee drinkers as compared to non-coffee drinkers

SLIDE 56

Some Special Cases

n Given n If 1 = 2 = 3 = 0 n Neither smoking nor coffee drinking is

associated with increased risk of CHD

SMK COF SMK COF Y Y * ) Pr( ) 1 Pr( log

3 2 1

β β β β + + + =         = =

SLIDE 57

Some Special Cases

n Given n If 1 = 3 = 0 n Smoking, but not coffee drinking, is

associated with increased risk of CHD

SMK COF SMK COF Y Y * ) Pr( ) 1 Pr( log

3 2 1

β β β β + + + =         = =

SLIDE 58

Some Special Cases

n If 3 = 0 n Smoking and coffee drinking are both

associated with risk of CHD but the odds ratio

f CHD-smoking is the same at levels of

coffee

n Smoking and coffee drinking are both

associated with risk of CHD but the odds ratio

f CHD-coffee is the same at levels of

smoking.

SLIDE 59

Confounding

n In epidemiological terms, Z is a “confounder”

f the relationship of Y with X if Z is related

to both X and Y and Z is not in the causal pathway between X and Y

n In statistical terms, Z is a “confounder” of the

relationship of Y with X if the X coefficient changes when Z is added to a regression of Y

n X

SLIDE 60

Confounding

n For example, consider the two models

Y = 0 + 1X + 1 Y = 0 + 1X + 2Z + 2

n then Z is a confounder of the X, Y

relationship if 1 1

SLIDE 61

Look at Confidence Intervals

n Without Smoking

OR = e0.79 = 2.2

n 95% CI for log(OR): 0.79 ± 1.96(0.33)

= (0.13, 1.44)

n 95% CI for OR: (e0.13, e1.44)

= (1.14, 4.24)

SLIDE 62

Look at Confidence Intervals

n With Smoking (adjusting for smoking)

OR = e0.53 = 1.7

n 95% CI for log(OR): 0.53 ± 1.96(0.35)

= (-0.17, 1.22)

n 95% CI for OR: (e-0.17, e1.22)

= (0.85, 3.39)

SLIDE 63

Conclusion

n So, ignoring smoking, the CHD and

coffee OR is 2.2 (95% CI: 1.14 - 4.26)

n Adjusting for smoking, gives more

modest evidence for a coffee effect

n In this case-control study, smoking is a

weak-to-moderate confounder of the coffee-CHD association

SLIDE 64

Interaction Model

Model 3 2.4 .55 1.3 Smoking

.59

.73

.43

Coffee* Smoking 1.5 .45 .69 Coffee

3.4

.30

1.0

Intercept z se Est Variable

SLIDE 65

Testing Interaction Term

n Z= -0.59, p-value = 0.554 n 95% Confidence interval for 13

n (0.42, 3.99)

n Both of the above suggest that there is

little evidence that smoking is an effect modifier!

SLIDE 66

Likelihood Ratio Test

n The Likelihood Ratio Test will help decide

whether or not additional term(s) “significantly” improve the model fit

n Likelihood Ratio Test (LRT) statistic for

comparing nested models is

n -2 times the difference between the log likelihoods

(LLs) for the Null -vs- Extended models

n the obtained is identical to from an

analysis of variance test for linear regression models

SLIDE 67

Likelihood Ratio Test

Deviance is a term used for the difference in

2* log likelihood relative to the best possible value from

a perfectly predicting model. Change in deviance is the same as change in -2LL.

SLIDE 68

LRT Example

SLIDE 69

Model comparisons using likelihood ratio test

SLIDE 70

Summary: Unadjusted ORs

n The odds of CHD was estimated to be

3.4 times higher among smokers compared to non-smokers

n 95% CI: (1.7, 7.9)

n The odds of CHD was estimated to be

2.2 times higher among coffee drinkers compared to non-coffee drinkers

n 95% CI: (1.1, 4.3)

SLIDE 71

Summary: Adjusted ORs

n Controlling for the potential

confounding of smoking, the coffee

dds ratio was estimated to be 1.7 with

95% CI: (.85, 3.4).

n Hence, the evidence in these data are

insufficient to conclude coffee has an independent effect on CHD beyond that

f smoking.

SLIDE 72

Comparing the models

n Models C and F are both nested in

Model A

n Models C and F cannot be directly

compared to one another, but we can see which has a smaller p-value when compared to Model A

n C vs. A: X2 = 26.5 with 2 df n F vs. A: X2 = 21.7 with 3 df

SLIDE 73

What next?

n Model C improves prediction beyond gender

alone (Model A) more than Model F.

n Model C should be the next parent model,

and we should test the new variables in Model F to see if they continue to improve prediction within the context of Model C.

n When a tentative final model is identified, the

assumptions of logistic regression should be checked.

SLIDE 74

74

Flexibility in linear models

n A spline allows the “slope” for a

continuous predictor to change at a given point; the coefficient is for the

difference in log odds ratio

n An interaction term allows the odds

ratio for one variable to differ by the value of a second variable; the coefficient is for the difference in log

dds ratio

SLIDE 75

Poisson regression model

n Log-linear model for mean rate

where p is the number of predictors in the model

n Random component: n Here:

SLIDE 76

Exponentiating Poisson regression models

SLIDE 77

Interpreting Poisson regression parameters

SLIDE 78

Modelling rates

n Of key interest in Poisson regression

models is to make inference about rates

f events

n We are often interested in whether the

rate of cancer, or some other disease, varies by population subgroups such as gender, race, or age

SLIDE 79

Person-years

n In defining rates, it is crucial to state

what denominator we have in mind

n For disease, we are usually interested in

disease rate per person, per year

n If the HIV incidence rate is 5 per 1

million person years, that means we expect to see 5 new cases of HIV per 1 million persons per year

SLIDE 80

Modelling Danish Cancer cases with an offset

n We observed Danish cancer cases in 6

age groups over a period of 4 years

n The model:

predicts log rates per 10,000 person years

SLIDE 81

Interpretation of coefficients

SLIDE 82

More about offsets

n The purpose of an offset is to specify

the denominator of the predicted rates

n We should always try to use an offset if

we suspect the underlying population sizes vary for the observed counts

n Typically, we’ll use log(N) as the offset,

where N is the sample size or number

f person years generating each count

SLIDE 83

Poisson regression for cohort studies

n Log-linear regression can be used to estimate

relative risks for cohort studies (but not case control)

n Relative risks is like relative rates, but we are

comparing risks (probability of disease) instead of rates (expected cases per person- year) across groups

n Could also estimate relative risk by

transforming results from logistic regression

SLIDE 84

Grand summary

n Exploratory analysis includes graphs

and tables – good to get a feel for the data

n Confirmatory analysis is useful for

making definitive conclusions

n Linear models provide us with a

framework in which to perform confirmatory analysis in many settings

SLIDE 85

Grand summary: linear models

n Linear regression: for continuous

(normal) outcomes

n Logistic regression: for binary outcomes n Poisson regression: for counts

SLIDE 86

Grand summary: modelling

n In all generalized linear models, we can

use the following tools to make models more flexible:

n Adjust for confounders using additive

covariates

n Effect modification allows by interaction

terms

n Curved and bent lines through polynomials

and splines

SLIDE 87

Grand summary: testing

n We can test significance of a single

predictor using z-test (or t-test for linear regression)

n Test significance of several covariates

using a pair of nested models by a likelihood ratio test

n Know how to interpret p-values and