modeling Dongmei Li Department of Public Health Sciences Office of - - PowerPoint PPT Presentation
modeling Dongmei Li Department of Public Health Sciences Office of - - PowerPoint PPT Presentation
Quantitative response variable modeling Dongmei Li Department of Public Health Sciences Office of Public Health Studies University of Hawaii at M noa Outline T-test ANOVA Correlation and simple linear regression 2 One-sample
Outline
2
T-test ANOVA Correlation and simple linear regression
One-sample t test
3
One-Sample t Test: Example
Statement of the problem:
Do Sudden Infant Death Syndrome (SIDS) babies have
lower than average birth weights?
We know from prior research that the mean birth
weight of the non-SIDS babies in the population is 3300 grams.
We study n = 10 SIDS babies, determine their birth
weights, and calculate x-bar = 2975.5 and s = 737.3.
Do these data provide significant evidence that SIDS
babies have different birth weights than the rest of the population?
SIDS baby weights: 2229 2997 2314 3831 1788 2745 4151 2975 3463 3262
4
One-Sample t Test
- A. Hypotheses. H0: µ = µ0 vs. Ha: µ ≠ µ0 (two-sided) [ Ha: µ < µ0
(left-sided) or Ha: µ > µ0 (right-sided)]
- B. Test statistic.
- C. P-value. Convert tstat to P-value [software]. Small P strong
evidence against H0
- D. Significance level α (compare P-value with α to determine
whether you will reject the null hypothesis or not).
1 with
stat
n df n s x t
5
A.
H0: µ = 3300 versus Ha: µ ≠ 3300 (two-sided)
- B. Test statistic
- C. P = 0.1974
Weak evidence against H0
- D. Data are not significant at α = .10. Fail to reject the null
hypothesis. 9 1 10 1 39 . 1 10 3 . 737 3300 5 . 2975
stat
n df SE x t
x
One-Sample t Test: Example
6
Confidence Interval for µ
n s t x
n
2
1 , 1
for CI % 100 ) 1 (
- Typical point “estimate ± margin of error” formula
- tn-1,1-α/2 is from t table
- Alternative formula:
n s SE SE t x
x x n
where
2
1 , 1
7
Confidence Interval: Example 1
grams 3502.9) to (2448.1 = 527.4 ± 5 . 2975 10 3 . 737 262 . 2 5 . 2975 for CI 95% 10 3 . 737 5 . 2975
2 05 .
1 , 1 10
n s t x n s x
Let us calculate a 95% confidence interval for μ for the birth weight of SIDS babies.
8
How to do it in Excel?
Open the Presentation3_SIDS_BW .xlsx data set Use the AVRAGE functions in Excel to calculate mean Use the STDEV functions in Excel to calculate standard deviation Use the TDIST function to calculate p-value Use the TINV function to get critical value
9
How to do it in JMP?
Open the Presentation3_SIDS_BW
.jmp data set
Analyze---Distribution---Put SIDS_BW into
Y, Column box--
- click OK---click on the red arrow next to SIDS_BW---click
- n test mean---input 3300 for hypothesized mean---click OK
10
Click confidence interval---95 to get 95% confidence interval
Paired-sample t test
11
Paired Samples
Paired samples: Each point in one sample is
matched to a unique point in the other sample
Pairs be achieved via sequential samples within
individuals (e.g., pre-test/post-test), cross-over trials, and match procedures
Also called “matched-pairs” and “dependent
samples”
12
Example: Paired Samples
A study addresses whether oat bran reduce LDL cholesterol
with a cross-over design.
Subjects “cross-over” from a cornflake diet to an oat bran
diet.
Half subjects start on CORNFLK, half on OATBRAN Two weeks on diet 1 Measures LDL cholesterol Washout period Switch diet Two weeks on diet 2 Measures LDL cholesterol
13
Example, Data
Subject CORNFLK OATBRAN
- --- ------- -------
1 4.61 3.84 2 6.42 5.57 3 5.40 5.85 4 4.54 4.80 5 3.98 3.68 6 3.82 2.96 7 5.01 4.41 8 4.34 3.72 9 3.80 3.49 10 4.56 3.84 11 5.35 5.26 12 3.89 3.73
14
Calculate Difference Variable “DELTA”
Step 1 is to create difference variable “DELTA” Let DELTA = CORNFLK - OATBRAN Order of subtraction does not materially effect results (but does change sign
- f differences)
Here are the first three observations:
Positive values represent lower LDL
- n oatbran
ID CORNFLK OATBRAN DELTA
- --- ------- ------- -----
1 4.61 3.84 0.77 2 6.42 5.57 0.85 3 5.40 5.85 -0.45 ↓ ↓ ↓ ↓
15
Explore DELTA Values
Stemplot |-0|42 |+0|0133 |+0|667788 ×1
Here are all the twelve paired differences (DELTAs):
0.77, 0.85, −0.45, −0.26, 0.30, 0.86, 0.60, 0.62, 0.31, 0.72, 0.09, 0.16
EDA shows a slight negative skew, a median of about 0.45, with results varying from −0.4 to 0.8.
16
Descriptive stats for DELTA
Data (DELTAs): 0.77, 0.85, −0.45, −0.26, 0.30, 0.86, 0.60, 0.62, 0.31, 0.72, 0.09, 0.16 0.4335 0.3808 12
d d
s x n The subscript d will be used to denote statistics for difference variable DELTA
17
95% Confidence Interval for µd
n s t x
d n d d
2
1 , 1
for CI % 100 ) 1 (
A t procedure directed toward the DELTA variable calculates the confidence interval for the mean difference.
) 656 . to 105 . ( 2754 . 0.3808 12 4335 . 201 . 2 0.3808 for CI % 95 Table) t (from 201 . 2 use confidence 95% For
975 ,. 11 1 1 12
2 05
d ,
t t
.
“Oat bran” data:
18
Paired t Test
- Similar to one-sample t test
μ0 is usually set to 0, representing “no mean difference”, i.e., H0: μ = 0
- Test statistic:
n df n s x t
d d
1
stat
19
Paired t Test: Example
“Oat bran” data
- A. Hypotheses. H0: µd = 0 vs. Ha: µd 0
- B. Test statistic.
- C. P-value. P = 0.011 (via computer). The evidence against
H0 is statistically significant.
- D. Significance level. The evidence against H0 is significant at
α = .05 but is not significant at α = .01.
11 1 12 1 043 . 3 12 / 4335 . 38083 .
stat
n df n s x t
d
20
How to do it in Excel?
Open the Presentation3_cornflk.xlsx data set Paired sample t test in Excel
21
How to do it in Excel?
22
Results and interpretation
P-value = 0.0112: have evidence to show Oatbran can
significantly lower LDL cholesterol level compared to Cornflk.
23
How to do it in JMP?
24
How to do it in JMP?
25
P-value = 0.0112: have evidence to show Oatbran can significantly lower LDL cholesterol level compared to Cornflk. Notice the 95% confidence interval for their difference does not include 0.
Conditions for Inference
t procedures require these conditions:
SRS (individual observations or DELTAs) Valid information (no information bias) Normal population or large sample (central limit theorem)
26
The Normality Condition
The Normality condition applies to the sampling distribution of the mean, not the population. Therefore, it is OK to use t procedures when:
The population is Normal Population is not Normal but is symmetrical and n is at
least 5 to 10
The population is skewed and the n is at least 30 to 100
(depending on the extent of the skew)
27
Can a t procedures be used?
Dataset A is skewed and
small: avoid t procedures
Dataset B has a mild skew
and is moderate in size: use t procedures
Data set C is highly skewed
and is small: avoid t procedure
28
Two-independent sample t test
29
Example: Cholesterol and Type A & B
Personality
Group 1 (Type A personality): 233, 291, 312, 250, 246, 197, 268, 224, 239, 239, 254, 276, 234, 181, 248, 252, 202, 218, 212, 325 Group 2 (Type B personality): 344, 185, 263, 246, 224, 212, 188, 250, 148, 169, 226, 175, 242, 252, 153, 183, 137, 202, 194, 213
Do fasting cholesterol levels differ in Type A and Type B personality men? Data (mg/dl) are a subset from the Western Collaborative Group Study*
30
Exploratory & Descriptive Methods
Start with EDA Compare group shapes, locations
and spreads
Examples of applicable techniques
Side-by-side stemplots (right) Side-by-side boxplots (next slide)
Group 1 | | Group 2
- |1t|3
|1f|45 |1s|67 98|1.|8889 110|2*|011 33332|2t|22 55544|2f|4455 76|2s|6 9|2.| 21|3*| |3t| |3f|4 (×100)
31
Side-by-Side Boxplots
20 20 N =
GROUP
2 1 400 300 200 100
21 20
Interpretation :
- Location:
group 1 > group 2
- Spreads:
group 1 < group 2
- Shapes: Both fairly
symmetrical, outside values in each; no major departures from Normality
32
Summary Statistics
Group n mean std dev 1 20 245.05 36.64 2 20 210.30 48.34
33
Inference About Mean Difference (Notation)
Parameters (population) Group 1 N1 µ1 σ1 Group 2 N2 µ2 σ2 Statistics (sample) Group 1 n1 s1 Group 2 n2 s2
1
x
2
x
2 1 2 1
- f
estimator point the is x x
34
Hypothesis Test
A.
Hypotheses. H0: μ1 = μ2 against Ha: μ1 ≠ μ2 (two-sided) [Ha: μ1 > μ2 (right-sided) Ha: μ1 < μ2 (left-sided) ]
B.
Test statistic.
- C. P-value. Convert the tstat to P-value with t table or software.
Interpret.
- D. Significance level (optional). Compare P to prior
specified α level.
Welch 2 2 2 1 2 1 2 1 stat
2 1 2 1
where ) ( df n s n s SE SE x x t
x x x x
35
Hypothesis Test – Example
- A. Hypotheses. H0: μ1 = μ2 vs. Ha: μ1 ≠ μ2
- B. Test stat. In prior analyses we calculated sample mean
difference = 34.75 mg/dL, SE = 13.563 and dfconserv = 19.
- C. P-value. P = 0.019 → good evidence against H0 (“significant
difference”).
- D. Significance level (optional). The evidence against H0 is
significant at α = 0.05 but not at α = 0.01. df SE x x t
x x
19 with 2.56 13.563 34.75 ) (
2 1
2 1 stat
36
Equal Variance t Procedure
Also called pooled variance t procedure Not as robust as prior method, but… Historically important Calculated by software programs Leads to advanced ANOVA techniques
37
We start by calculating this pooled estimate of variance
1 and group in variance the is where ) )( ( ) )( (
2 2 1 2 2 2 2 1 1 2
i i i pooled
n df i s df df s df s df s
Pooled variance procedure
38
The pooled variance is used to calculate this standard
error estimate:
Confidence Interval Test statistic All with df = df1 + df2 = (n1−1) + (n2−1) 1 1
2 1 2
2 1
n n s SE
pooled x x
) )( ( ) (
2 1 2
1 , 2 1 x x df
SE t x x
) (
2 1
2 1 stat x x
SE x x t
39
Pooled Variance t Confidence Interval
38 ) 1 20 ( ) 1 20 ( 56 . 13 20 1 20 1 1839.623
2 1
df SE
x x
62.14) (7.36, 39 . 27 75 . 34 ) 13.56 )( 02 . 2 ( ) 30 . 210 05 . 245 ( ) )( ( ) ( for CI % 95
2 1
975 ,. 38 2 1 2 1
x x
SE t x x
Group ni si xbari 1 20 36.64 245.05 2 20 48.34 210.30
Data
40
Pooled Variance t Test
38 ) 1 20 ( ) 1 20 ( 56 . 13 20 1 20 1 1839.623
2 1
df SE
x x
014 . 38 2.56; 56 . 13 75 . 34 : :
2 1
2 1 stat 2 1 2 1
P df SE x x t H H
x x a
Data: Group ni si xbari 1 20 36.64 245.05 2 20 48.34 210.30
41
How to do it in Excel?
Data set: Presentation3_FCL.xlsx First do Levene’s test to see whether two group has equal
variance
42
How to do it in Excel?
Levene’s test shows no significant difference in variance
between groups
43
How to do it in Excel?
Next use t-Test: Two-Sample Assuming Equal Variances for
the test
44
How to do it in Excel?
Click OK to get results
45
Excel results
p-value = 0.014 Significant difference
in fasting cholesterol levels between Type A personality subjects and Type B personality subjects.
46
How to do it in JMP?
Data set: Presentation3_FCL.jmp Analyze --- Fit
Y by X
47
How to do it in JMP?
Select
Means/ANOVA/ Pooled t for the equal variance t test
Select t Test for
unequal variance t test
48
Results from JMP
p-value <0.05 Significant
difference in fasting cholesterol levels between Type A personality subjects and Type B personality subjects.
49
Conditions for Inference
Conditions required for t procedures: “Validity conditions”
- a. Good information (no information bias)
- b. Good sample (“no selection bias”)
- c. “No confounding”
“Sampling conditions”
- a. Independence
- b. Normal sampling distribution
50
ANOVA
51
52
Illustrative Example: Data
Pets as moderators of a stress response. This chapter follows the analysis of data from a study in which heart rates (bpm) of participants were monitored after being exposed to a psychological stressor. Participants were randomized to one of three groups:
Group 1 - monitored in presence of pet dog Group 2 - monitored in the presence of human friend Group 3 - monitored with neither dog nor human friend present
53
Illustrative Example: Data
54
Descriptive Statistics
Data are described and explored before moving to inferential
calculations
Here are summary statistics by group:
55
Side-by-Side Boxplots
56
Analysis of Variance
One-way ANalysis Of VAriance (ANOVA)
Categorical explanatory variable Quantitative response variable Test group means for a significant difference
Statistical hypotheses
H0: μ1 = μ2 = … = μk Ha: at least one of the μis differ
Method: compare variability between groups to variability within
groups (F statistic)
57
Analysis of Variance, cont.
- R. A. Fisher
(1890-1962) The F in the F statistic stands for “Fisher”
58
Mean Square Between: Graphically
59
Mean Square Between: Example
60
Mean Square Within: Graphically
61
Mean Square Within: Example
62
The e F F sta stati tisti stic a c and nd AN ANOVA A ta table
Data are arranged to form an ANOVA table F statistic is the ratio of the MSB to MSW
08 . 14 793 . 84 843 . 1193 MSW MSB Fstat
Fstat “signal-to- noise” ratio
63
Fstat and P-value
The Fstat has numerator and denominator degrees of
freedom: df1 and df2 respectively (corresponding to dfB and dfW)
Convert Fstat to P-value with a computer program The P-value corresponds to the area in the right tail
beyond
64
Fstat and P-value
P < 0.001
How to do one-way ANOVA in EXCEL?
65
Data set: Presentation3_pet.xlsx
How to do one-way ANOVA in EXCEL?
66
Analysis results from Excel
One-way ANOVA shows significant difference in mean FEV
values among the four different smoker groups.
67
How to do one-way ANOVA in JMP?
Presentation3_pet.jmp file Analyze---Fit
Y by X
68
How to do one-way ANOVA in JMP?
69
Analysis results from JMP
Pairwise comparisons
from Tukey’s method shows the signficant difference among all three groups.
70
Correlation and simple linear regression
71
72
Data type for correlation and regression
Quantitative response variable Y (“dependent variable”) Quantitative explanatory variable X (“independent variable”) Historically important public health data set used to
illustrate techniques (Doll, 1955)
n = 11 countries Explanatory variable = per capita cigarette consumption in 1930
(CIG1930)
Response variable = lung cancer mortality per 100,000
(LUNGCA)
73
Data, cont.
74
Scatterplot
Bivariate (xi, yi) points plotted as scatter plot.
75
Inspect scatterplot’s
Form: Can the relation be described with a straight or some
- ther type of line?
Direction: Do points tend trend upward or downward? Strength of association: Do point adhere closely to an
imaginary trend line?
Outliers (in any): Are there any striking deviations from
the overall pattern?
Judging Correlational Strength
Correlational strength refers
to the degree to which points adhere to a trend line
The eye is not a good judge of
strength.
The top plot appears to show a
weaker correlation than the bottom plot. However, these are plots of the same data sets. (The perception of a difference is an artifact of axes scaling.)
76
Correlation
Correlation coefficient r quantifies linear relationship with a number between −1 and 1.
When all points fall on a line with an upward slope, r = 1. When all data points fall on a line with a downward slope, r = −1
When data points trend upward, r is positive; when data points trend downward, r is negative.
The closer r is to 1 or −1, the stronger the correlation.
77
Examples of correlations
78
Calculating r (Pearson Correlation)
Formula
Correlation coefficient tracks the degree to which X and Y “go together.”
Recall that z scores quantify the amount a value lies above or
below its mean in standard deviations units.
When z scores for X and
Y track in the same direction, their products are positive and r is positive (and vice versa).
79
Calculating r, Example
80
Scatter plot and r in Excel
Data set: Presentation3_CIGLungCA.xlsx Scatter plot: select the data in column B and C---Insert ---
Scatter Plot---Add x and y axis labels
Correlation: Data analysis---correlation
81
5 10 15 20 25 30 35 40 45 50 200 400 600 800 1000 1200 1400 Lung cancer mortality CIG1930
Lung cancer mortality vs. cigarette consumption
Scatter plot and r in JMP
Data set: Presentation3_CIGLungCA.jmp Scatter plot: Graph---Scatter plot matrix
82
Interpretation of r
1.
- Direction. The sign of r indicates the direction of the
association: positive (r > 0), negative (r < 0), or no association (r ≈ 0).
2.
- Strength. The closer r is to 1 or −1, the stronger the
association.
3.
Coefficient of determination. The square of the correlation coefficient (r2) is called the coefficient of
- determination. This statistic quantifies the proportion of
the variance in Y [mathematically] “explained” by X. For the illustrative data, r = 0.737 and r2 = 0.54. Therefore, 54% of the variance in Y is explained by X.
83
Notes, cont.
- 4. Reversible relationship. With correlation, it does not
matter whether variable X or Y is specified as the explanatory variable; calculations come out the same either
- way. [This will not be true for regression.]
- 5. Outliers. Outliers can have
a profound effect on r. This figure has an r of 0.82 that is fully accounted for by the single outlier.
84
Notes, cont.
- 6. Linear relations only. Correlation applies only to linear
relationships This figure shows a strong non-linear relationship, yet r = 0.00.
- 7. Correlation does not
necessarily mean causation. Beware lurking variables (next slide).
85
Confounded Correlation
A near perfect negative correlation (r = −.987) was seen between cholera mortality and elevation above sea level during a 19th century epidemic.
We now know that cholera is transmitted by water. The
- bserved relationship
between cholera and elevation was confounded by the lurking variable proximity to polluted water.
86
Hypothesis Test
We conduct the hypothesis test to guard against identifying too many random correlations. Random selection from a random scatter can result in an apparent correlation
87
Hypothesis Test
- A. Hypotheses. Let ρ represent the population
correlation coefficient. H0: ρ = 0 vs. Ha: ρ ≠ 0 (two-sided)
[or Ha: ρ > 0 (right-sided) or Ha: ρ < 0 (left-sided)]
- B. Test statistic
- C. P-value. Convert tstat to P-value with software or t
table.
2 2 1 where
2 stat
n df n r SE SE r t
r r
88
Hypothesis Test – Illustrative Example
A. H0: ρ = 0 vs. Ha: ρ ≠ 0 (two-sided) B. Test stat
- C. .005 < P < .01 by Table C. P = .0097 by computer.
The evidence against H0 is highly significant.
9 2 11 3.27 0.2253 737 . 0.2253 2 11 737 . 1
stat 2
df t SE r
89
Exercise (True/False)
1. Correlation coefficient r quantifies the
relationship between quantitative variables X and Y.
2. The closer r is to 1, the stronger the
linear relation between X and Y.
3. If r is close to zero, X and Y are unrelated. 4. The value of r changes when the units of
measure are changed.
90
Regression
Regression describes the
relationship in the data with a line that predicts the average change in Y per unit X.
The best fitting line is
found by minimizing the sum of squared residuals, as shown in this figure.
91
Regression Line, cont.
The regression line equation is:
where ŷ ≡ predicted value of Y, a ≡ the intercept of the line, and b ≡ the slope of the line
Equations to calculate a and b
SLOPE: INTERCEPT:
92
Regression Line, cont.
Slope b is the key statistic produced by the regression
93
Calculate regression line in Excel
Data analysis---regression
94
Calculate regression line in JMP
Analyze---fit
Y by X
95
Conditions for Inference
Inference about the regression line requires these conditions
Linearity Independent observations Normality at each level of X Equal variance at each level of X
96
Conditions for Inference
This figure illustrates Normal and equal variation around the regression line at all levels of X
97
Assessing Conditions
The scatterplot should be visually inspected for linearity,
Normality, and equal variance
Plotting the residuals from the model can be helpful in this
regard.
The table lists residuals for the illustrative data
98
Assessing Conditions, cont.
A stemplot of the residuals show
no major departures from Normality
This residual plot shows more
variability at higher X values (but the data is very sparse)
|-1|6 |-0|2336 | 0|01366 | 1|4 x10
99
Residual Plots
With a little experience, you can get good at reading residual plots. Here’s an example of linearity with equal variance.
100
Residual Plots
Example of linearity with unequal variance
101
Example of Residual Plots
Example of non-linearity with equal variance
102
103