Lecture 8: F-Test for Nested Linear Models
Zhenke Wu Department of Biostatistics Johns Hopkins Bloomberg School of Public Health zhwu@jhu.edu http://zhenkewu.com 11 February, 2016
Lecture 8 140.653 Methods in Biostatistics 1
Lecture 8: F -Test for Nested Linear Models Zhenke Wu Department of - - PowerPoint PPT Presentation
Lecture 8: F -Test for Nested Linear Models Zhenke Wu Department of Biostatistics Johns Hopkins Bloomberg School of Public Health zhwu@jhu.edu http://zhenkewu.com 11 February, 2016 Lecture 8 140.653 Methods in Biostatistics 1 Lecture 7
Lecture 8 140.653 Methods in Biostatistics 1
◮ Yi independently distributed
i ) ◮ Zi = Yi−µi σi
iid
◮ Define quadratic forms Q1 = Z 2 1 + · · · + Z 2 n1 and
n1+1 + · · · + Z 2 n1+n2 ◮ Q1 ∼ χ2 n1 with mean n1 and variance 2n1 ◮ Q2 ∼ χ2 n2 with mean n2 and variance 2n2 ◮ Q1 is independent of Q2 ◮ Fn1,n2 = Q1/n1 Q2/n2 ∼ F(n1, n2) (F-distribution with n1 and n2 degrees of
Lecture 8 140.653 Methods in Biostatistics 2
◮ Data:
◮ n observations; p + s covariates ◮ continuous outcome Yi, measured with error ◮ covariates: Xi = (Xi1, . . . , Xip, Xi,p+1, . . . , Xi,p+s)⊤, for i = 1, . . . , n
◮ Question: In light of data, can we use a simpler linear model
◮ Hypothesis testing:
◮ XN: design matrix n × (p + 1) obtained by stacking observations Xi ◮ First p (transformed) covariates and 1 intercept ◮ Regression coefficients: βN = (β0, β1, . . . , βp)⊤ ◮ Standard deviation of measurement errors: σ
◮ XE : design matrix with intercept+p + s covariates ◮ βE = (β⊤
N , βp+1, . . . , βp+s)⊤
◮
Lecture 8 140.653 Methods in Biostatistics 3
◮ Rationale of the F-Test
◮ If H0 is true, estimates
◮ Reject H0 if these estimates are sufficiently different from 0s. ◮ However, not every
◮ Use a quadratic term to measure their joint differences from 0,
[p+]
◮ VarE[
E XE)−1A⊤, where A = [0s×(p+1), Is×s]
◮ Estimate σ2 by RSSE/(n − p − s − 1); RSS for ”residual sum of
Lecture 8 140.653 Methods in Biostatistics 4
◮
◮ F(s, n − p − s − 1): F-distribution with s and n − p − s − 1 degrees
◮ RSSN = Y ′(I − HN)Y ; HN = XN(X ′ NXN)−1XN; “H” for hat matrix,
◮ RSSE = Y ′(I − HE)Y ; HE = XE(X ′ EXE)−1XE ◮ (RSSN − RSSE)/σ2 ∼ χ2 s and RSSE/σ2 ∼ χ2 n−p−s−1; they are
◮ Algebraic: The former is a function of
◮ Geometric: Squared lengths of orthogonal vectors Lecture 8 140.653 Methods in Biostatistics 5
◮
◮
Lecture 8 140.653 Methods in Biostatistics 6
NRN R′
NRN
n−p−1 = S2 N
ERE R′
E RE
n−p−s−1 = S2 E
NRN − R′ ERE) R′
NRN−R′ E RE
s
NRN − R′ ERE ◮ Fs,n−p−s−1 = (R′
NRN−R′ E RE )/s
R′
E RE /(n−p−s−1)
◮ Reject H0 if F >
Lecture 8 140.653 Methods in Biostatistics 7
◮ n2 → ∞:
◮ Q2/n2
in probability
◮ For a fixed n1, Fn1,n2
in distribution
n1/n1 as n2 approaches
◮ Or equivalently n1Fn1,∞ ∼ χ2
n1
◮ If s = 1:
◮ The F-statistic equals (
βp+1)2 for testing the null model
◮ Under H0, it is distributed as F(1, n − p − 2) ◮ Approximately distributed as χ2
1/1 when n >> p (therefore 3.84 is
Lecture 8 140.653 Methods in Biostatistics 8
Lecture 8 140.653 Methods in Biostatistics 9
50 100 150 200 250 0.0 0.4 0.8 1
1
5 10 15 20 25 0.0 0.4 0.8
2
2 4 6 8 10 0.0 0.4 0.8
3
2 4 6 8 10 0.0 0.4 0.8
100
2 4 6 8 10 0.0 0.4 0.8
1000
2 4 6 8 10 0.0 0.4 0.8
2e+08
50 100 150 200 250 2 5 10 15 20 25 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 50 100 150 200 250 3 5 10 15 20 25 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 50 100 150 200 250 5 5 10 15 20 25 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 50 100 150 200 250 6 5 10 15 20 25 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
df1 df2 df1 df2
Lecture 8 140.653 Methods in Biostatistics 10
◮ Data: National Medical Expenditure Survey (NMES) ◮ Objective: To understand the relationship between medical
◮ Yi = loge(total medical expenditurei + 1) ◮ Xi1 = agei − 65 years ◮ Xi2 = ♂ ◮ # of subjects : n = 4078
Lecture 8 140.653 Methods in Biostatistics 11
Lecture 8 140.653 Methods in Biostatistics 12
◮ Compare which two models? ◮ Calculate Residual Sum of Squares and Residual Mean Squares. ◮ Calculate F-statistic; What are the degrees of freedom for its
◮ Compare it to the critical value at the 0.05 level
Lecture 8 140.653 Methods in Biostatistics 13
◮ H0: Within a larger model B, model A is true (or state the scientific
◮
change in df
residual sum of squares
◮ This statistic, under repeated sampling, has a F(2, 4073)
2/2 distributed. ◮ p-value: Pr(χ2/2 > 5.03) = 0.0065 by approximation or
◮ Reject linearity in age.
Lecture 8 140.653 Methods in Biostatistics 14
◮ Is the non-linear relationship of average log expenditure on age the
◮ Or equivalently, is the difference between average log medical
Lecture 8 140.653 Methods in Biostatistics 15
◮ H0: Within a larger model C, model B is true (or equivalently state
◮
◮ Under repeated sampling, it is F(3, 4070) distributed. ◮ p-value Pr(χ2 3/3 > 4.59) = 0.0032 by approximation, or
◮ Reject no-interaction assumption
Lecture 8 140.653 Methods in Biostatistics 16
◮ Ingo’s Notes: http://biostat.jhsph.edu/ iruczins/teaching/140.751/ ◮ F = n−p−s−1 s
RSSE − 1
s
RSSN/n
RSSN/n
d
df2).
◮ Delta method to calculate the variance of a function of estimates.
Lecture 8 140.653 Methods in Biostatistics 17