Additional Topics on Linear Regression Ping Yu School of Economics - - PowerPoint PPT Presentation

additional topics on linear regression
SMART_READER_LITE
LIVE PREVIEW

Additional Topics on Linear Regression Ping Yu School of Economics - - PowerPoint PPT Presentation

Additional Topics on Linear Regression Ping Yu School of Economics and Finance The University of Hong Kong Ping Yu (HKU) Additional Topics 1 / 49 Tests for Functional Form Misspecification 1 Nonlinear Least Squares 2 Omitted and


slide-1
SLIDE 1

Additional Topics on Linear Regression

Ping Yu

School of Economics and Finance The University of Hong Kong

Ping Yu (HKU) Additional Topics 1 / 49

slide-2
SLIDE 2

1

Tests for Functional Form Misspecification

2

Nonlinear Least Squares

3

Omitted and Irrelevant Variables

4

Model Selection

5

Generalized Least Squares

6

Testing for Heteroskedasticity

7

Regression Intervals and Forecast Intervals

Ping Yu (HKU) Additional Topics 2 / 49

slide-3
SLIDE 3

Review of Our Assumptions

Assumption OLS.0 (random sampling): (yi,xi), i = 1, ,n, are i.i.d. Assumption OLS.1 (full rank): rank(X) = k. Assumption OLS.10: rank(E[xx0]) = k. Assumption OLS.2 (first moment): E[yjx] = x0β. Assumption OLS.20: y = x0β + u with E[xu] = 0. Assumption OLS.3 (second moment): E[u2] < ∞. Assumption OLS.30 (homoskedasticity): E[u2jx] = σ2. Assumption OLS.4 (normality): ujx N(0,σ2). Assumption OLS.5: E[u4] < ∞ and E h kxk4i < ∞.

Ping Yu (HKU) Additional Topics 2 / 49

slide-4
SLIDE 4

y = x0β + u Implied Properties E[xu] = 0 linear projection = ) consistency [ E[ujx] = 0 linear regression = ) unbiasedness [ E[ujx] = 0 E[u2jx] = σ2 homoskedastic linear regression = ) Gauss-Markov Theorem Table 1: Relationship between Different Models

Ping Yu (HKU) Additional Topics 3 / 49

slide-5
SLIDE 5

Our Plan of This Chapter

Examine the validity of these assumptions and cures when they fail, especially, OLS.2 and OLS.30. Assumption OLS.2 has many implications. For example, it implies

  • the conditional mean of y given x is linear in x.
  • all relevant regressors are included in x and are fixed.

Section 1 and 2 examine the first implication:

  • Section 1 tests whether E[yjx] is indeed x0β.
  • Section 2 provides more flexible specifications of E[yjx].

Section 3 and 4 examine the second implication:

  • Section 3 checks the benefits and costs of including irrelevant variables or
  • mitting relevant variables.
  • Section 4 provides some model selection procedures based on information

criteria. Section 5 and 6 examine Assumption OLS.30:

  • Section 5 shows that there are more efficient estimators of β when this

assumption fails.

  • Section 6 checks whether this assumption fails.

Section 7 examines the external validity of the model.

Ping Yu (HKU) Additional Topics 4 / 49

slide-6
SLIDE 6

Tests for Functional Form Misspecification

Tests for Functional Form Misspecification

Ping Yu (HKU) Additional Topics 5 / 49

slide-7
SLIDE 7

Tests for Functional Form Misspecification

Ramsey’s REgression Specification Error Test (RESET)

Misspecification of E[yjx] may be due to omitted variables or misspecified functional forms. We only examine the second source of misspecification here. A Straightforward Test: add nonlinear functions of the regressors to the regression, and test their significance using a Wald test.

  • fit yi = x0

i e

β +z0

ie

γ + e ui by OLS, and form a Wald statistic for γ = 0, where zi = h(xi) denote functions of xi which are not linear functions of xi (perhaps squares of non-binary regressors). RESET: The null model is yi = x0

iβ + ui.

Let zi = B @ b y2

i

. . . b ym

i

1 C A be an (m 1)-vector of powers of b yi = x0

i b

β. Then run the auxiliary regression yi = x0

i e

β + z0

ie

γ + e ui (1) by OLS, and form the Wald statistic Wn for γ = 0.

Ping Yu (HKU) Additional Topics 6 / 49

slide-8
SLIDE 8

Tests for Functional Form Misspecification

continue...

Under H0, Wn

d

  • ! χ2
  • m1. Thus the null is rejected at the α level if Wn exceeds the

upper α tail critical value of the χ2

m1 distribution.

To implement the test, m must be selected in advance. Typically, small values such as m = 2,3, or 4 seem to work best. The RESET test is particularly powerful in detecting the single-index model, yi = G(x0β) + ui, where G() is a smooth "link" function. Why? Note that (1) may be written as yi = x0

i e

β +

  • x0

i b

β 2e γ1 + +

  • x0

i b

β m e γm1 + e ui, which has essentially approximated G() by an mth order polynomial. Other tests: Hausman test, Conditional Moment (CM) test, Zheng’s score test.

Ping Yu (HKU) Additional Topics 7 / 49

slide-9
SLIDE 9

Nonlinear Least Squares

Nonlinear Least Squares

Ping Yu (HKU) Additional Topics 8 / 49

slide-10
SLIDE 10

Nonlinear Least Squares

Nonlinear Least Squares

If the specification test rejects the linear specification, we may consider to use a nonlinear setup for E[yjx]. Suppose E[yijxi = x] = m(xjθ). Nonlinear regression means that m(xjθ) is a nonlinear function of θ (rather than x). The functional form of m(xjθ) can be suggested by an economic model, or as in the LSE, it can be treated as a nonlinear approximation to a general conditional mean function. m(xjθ) = exp(x0θ): Exponential Link Regression

  • The exponential link function is strictly positive, so this choice can be useful

when it is desired to constrain the mean to be strictly positive. m(xjθ) = θ1 + θ2xθ 3, x > 0: Power Transformed Regressors

  • A generalized version of the power transformation is the famous Box-Cox (1964)

transformation, where the regressor xθ 3 is generalized as x(θ 3) with x(λ) = (

xλ 1 λ

, logx, if λ > 0, if λ = 0.

  • The function x(λ) nests linearity (λ = 1) and logarithmic (λ = 0) transformations

continuously.

Ping Yu (HKU) Additional Topics 9 / 49

slide-11
SLIDE 11

Nonlinear Least Squares 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5

Figure: Box-Cox Transformation for Different λ Values: all pass (1,0)

Ping Yu (HKU) Additional Topics 10 / 49

slide-12
SLIDE 12

Nonlinear Least Squares

continue...

m(xjθ) = θ1 + θ2 exp(θ3x): Exponentially Transformed Regressors m(xjθ) = G(x0θ), G known

  • When G() = (1+ exp())1, the regression with m(xjθ) = G(x0θ) is called the

logistic link regression.

  • When G() = Φ() with Φ() being the cdf of standard normal, it is called the

probit link regression. m(xjθ) = θ0

1x1 + θ0 2x1G

  • x2θ 3

θ 4

  • : Smooth Transition

m(xjθ) = θ1 + θ2x + θ3 (x θ4)1(x > θ4): Continuous Threshold Regression m(xjθ) = (θ0

1x1)1(x2 θ3) + (θ0 2x1)1(x2 > θ3): Threshold Regression

  • When θ4 = 0, the smooth transition model (STM) reduces to the threshold

regression (TR) model.

  • When θ4 = ∞, the STM reduces to linear regression.
  • For CTR and TR, m(xjθ) is not smooth in θ.

Ping Yu (HKU) Additional Topics 11 / 49

slide-13
SLIDE 13

Nonlinear Least Squares 0.5 1 1.5 2 0.5 1 1.5 2 0.5 1 1.5 2 0.5 1 1.5 2 0.5 1 1.5 2 0.5 1 1.5 2 0.5 1 1.5 2 0.5 1 1.5 2

Figure: Difference Between STM, CTR and TR

Ping Yu (HKU) Additional Topics 12 / 49

slide-14
SLIDE 14

Nonlinear Least Squares

The NLLS Estimator

The nonlinear least squares (NLLS) estimator is b θ = argmin

θ n

i=1

(yi m(xijθ))2. The FOCs for minimization are 0 = ∑n

i=1 mθ (xijb

θ)b ui with mθ (xjθ) = ∂

∂θ m(xjθ).

Theorem If the model is identified and m(xjθ) is differentiable with respect to θ, p n

  • b

θ θ0

  • d
  • ! N (0,V),

where V = E

  • mθim0

θi

1 E h mθim0

θiu2 i

i E

  • mθim0

θi

1 with mθi = mθ (xijθ0). V can be estimated by b V = 1 n

n

i=1

b mθi b m0

θi

!1 1 n

n

i=1

b mθi b m0

θib

u2

i

!1 1 n

n

i=1

b mθi b m0

θi

!1 , where b mθi = mθ (xijb θ), and b ui = yi m(xijb θ).

Ping Yu (HKU) Additional Topics 13 / 49

slide-15
SLIDE 15

Omitted and Irrelevant Variables

Omitted and Irrelevant Variables

Ping Yu (HKU) Additional Topics 14 / 49

slide-16
SLIDE 16

Omitted and Irrelevant Variables

Omitted Variables

The true model is a long regression yi = x0

1iβ 1 + x0 2iβ 2 + ui,E[xiui] = 0,

(2) but we estimate a short regression yi = x0

1iγ1 + vi,E[x1ivi] = 0.

(3) β 1 6= γ1. Why? γ1 = E

  • x1ix0

1i

1 E[x1iyi] = E

  • x1ix0

1i

1 E[x1i

  • x0

1iβ 1 + x0 2iβ 2 + ui

] = β 1 + E

  • x1ix0

1i

1 E[x1ix0

2i]β 2

= β 1 + Γβ 2, where Γ = E

  • x1ix0

1i

11 E[x1ix0

2i] is the coefficient from a regression of x2i on

x1i.

Ping Yu (HKU) Additional Topics 15 / 49

slide-17
SLIDE 17

Omitted and Irrelevant Variables

Direct and Indirect Effects of x1 on y

γ1 6= β 1 unless Γ = 0 or β 2 = 0. The former means that the regression of x2i on x1i yields a set of zero coefficients (they are uncorrelated), and the latter means that the coefficient on x2i in (2) is zero. The difference Γβ 2 is known as omitted variable bias. γ1 includes both the direct effect of x1 on y (β 1) and the indirect effect (Γβ 2) through x2. [Figure here] To avoid omitted variable bias the standard advice is to include potentially relevant variables in the estimated model. But many desired variables are not available in a given dataset. In this case, the possibility of omitted variable bias should be acknowledged and discussed in the course of an empirical investigation.

Ping Yu (HKU) Additional Topics 16 / 49

slide-18
SLIDE 18

Omitted and Irrelevant Variables

Figure: Direct and Indirect Effects of x1 on y

Ping Yu (HKU) Additional Topics 17 / 49

slide-19
SLIDE 19

Omitted and Irrelevant Variables

Irrelevant Variables

When β 2 = 0 and β 1 is the parameter of interest, x2i is "irrelevant". In this case, the estimator of β 1 from the short regression, β 1 = (X0

1X1)1X0 1y, is

consistent from the analysis above. Efficiency comparison under homoskedasticity: n AVar

  • β 1
  • = E
  • x1ix0

1i

1 σ2 Q1

11 σ2,

and n AVar b β 1

  • = Q1

11.2σ2

  • Q11 Q12Q1

22 Q21

1 σ2. If Q12 = E[x1ix0

2i] = 0 (so the variables are orthogonal) then the two estimators

have equal asymptotic efficiency. Otherwise, since Q12Q1

22 Q21 > 0, Q11 > Q11.2

and consequently Q1

11 σ2 < Q1 11.2σ2.

This means that β 1 has a lower asymptotic variance matrix than b β 1 if the irrelevant variables are correlated with the relevant variables.

Ping Yu (HKU) Additional Topics 18 / 49

slide-20
SLIDE 20

Omitted and Irrelevant Variables

Intuition

The irrelevant variable does not provide information for yi, but introduces multicollinearity to the system, so decreases the denominator of AVar b β 1

  • from

Q11 to Q11.2 without decreasing its numerator σ2. Take the model yi = β 0 + β 1xi + ui and suppose that β 0 = 0. Let b β 1 be the estimate of β 1 from the unconstrained model, and β 1 be the estimate under the constraint β 0 = 0 (the least squares estimate with the intercept

  • mitted.).

Then we can show under homoskedasticity, n AVar

  • β 1
  • =

σ2 σ2

x + µ2 ,

while n AVar b β 1

  • = σ2

σ2

x

, where E[xi] = µ, and Var(xi) = σ2

x.

When µ = E[1x] 6= 0, β 1 has a lower asymptotic variance.

Ping Yu (HKU) Additional Topics 19 / 49

slide-21
SLIDE 21

Omitted and Irrelevant Variables

When β 2 6= 0 and Q12 = 0

If x2 is relevant (β 2 6= 0) and uncorrelated with x1, then it decreases the numerator without affecting the denominator of AVar b β 1

  • , so should be included in the

regression. For example, including individual characteristics in a regression of beer consumption on beer prices leads to more precise estimates of the price elasticity because individual characteristics are believed to be uncorrelated with beer prices but affect beer consumption. β 2 = 0 β 2 6= 0 Q12 = 0 β 1 consistent β 1 consistent same efficiency long more efficient Q12 6= 0 β 1 consistent depends on Q12 β 2 short more efficient undetermined Table 2: Consistency and Efficiency with Omitted and Irrelevant Variables

Ping Yu (HKU) Additional Topics 20 / 49

slide-22
SLIDE 22

Omitted and Irrelevant Variables

The Heteroskedastic Case

From the last chapter, we know that when the model is heteroskedastic, it is possible that b β 1 is more efficient than β 1 (= b β 1R) even if x2 is irrelevant, or adding irrelevant variables can actually decrease the estimation variance. This result seems to contradict our initial motivation for pursuing restricted estimation (or short regression) - to improve estimation efficiency. It turns out that a more refined answer is appropriate. Constrained estimation is desirable, but not the RLS estimation. While least squares is asymptotically efficient for estimation of the unconstrained projection model, it is not an efficient estimator of the constrained projection model; the efficient minimum distance estimator is the choice.

Ping Yu (HKU) Additional Topics 21 / 49

slide-23
SLIDE 23

Model Selection

Model Selection

Ping Yu (HKU) Additional Topics 22 / 49

slide-24
SLIDE 24

Model Selection

Introduction

We discussed the costs and benefits of inclusion/exclusion of variables. How does a researcher go about selecting an econometric specification, when economic theory does not provide complete guidance? In practice, a large number of variables usually are introduced at the initial stage of modeling to attenuate possible modeling biases. On the other hand, to enhance predictability and to select significant variables, econometricians usually use stepwise deletion and subset selection. It is important that the model selection question be well-posed. For example, the question: "What is the right model for y?" is not well-posed, because it does not make clear the conditioning set. In contrast, the question, "Which subset of (x1, ,xK ) enters the regression function E[yijx1i = x1, ,xKi = xK ]?" is well posed.

Ping Yu (HKU) Additional Topics 23 / 49

slide-25
SLIDE 25

Model Selection

Setup

In many cases the problem of model selection can be reduced to the comparison

  • f two nested models, as the larger problem can be written as a sequence of such

comparisons. We thus consider the question of the inclusion of X2 in the linear regression y = X1β 1 + X2β 2 + u, where X1 is n k1 and X2 is n k2. This is equivalent to the comparison of the two models M1 : y = X1β 1 + u, E[ujX1,X2] = 0, M2 : y = X1β 1 + X2β 2 + u, E[ujX1,X2] = 0. Note that M1 M2. To be concrete, we say that M2 is true if β 2 6= 0. Notations: OLS residual vectors b u1 and b u2, estimated variances b σ2

1 and b

σ2

2, etc.

for Model 1 and 2. For simplicity, use the homoskedasticity assumption E[u2

i jx1i,x2i] = σ2.

Ping Yu (HKU) Additional Topics 24 / 49

slide-26
SLIDE 26

Model Selection

Consistent Model Selection Procedure

A model selection procedure is a data-dependent rule which selects one of the two models. We can write this as c M . A model selection procedure is consistent if P

  • c

M = M1jM1

  • !

1, P

  • c

M = M2jM2

  • !

1. Selection Based on the Wald Wn: For some significance level α, let cα satisfy P

  • χ2

k2 > cα

  • = α. Then select M1 if Wn cα, else select M2.

Inconsistency: If α is held fixed, then P( c M = M1jM1) ! 1α < 1 although P( c M = M2jM2) = 1.

Ping Yu (HKU) Additional Topics 25 / 49

slide-27
SLIDE 27

Model Selection

Information Criterion

Intuition: There exists a tension between model fit, as measured by the maximized log-likelihood value, and the principle of parsimony that favors a simple

  • model. The fit of the model can be improved by increasing model complexity, but

parameters are only added if the resulting improvement in fit sufficiently compensates for loss of parsimony. Most popular information criteria (IC) include the Akaike Information Criterion (AIC) and the Bayes Information Criterion (BIC). The former is inconsistent while the latter is consistent. The AIC under normality for model m is AICm = log

  • b

σ2

m

  • + 2km

n , (4) where b σ2

m is the variance estimate for model m and is roughly 2ln (neglecting the

constant term) with ln being the average log-likelihood function, and km is the number of coefficients in the model. The rule is to select M1 if AIC1 < AIC2, else select M2.

Ping Yu (HKU) Additional Topics 26 / 49

slide-28
SLIDE 28

Model Selection

Inconsistency of the AIC

Inconsistency: The AIC tends to overfit. Why? Under M1, LR = n

  • log
  • b

σ2

1

  • log
  • b

σ2

2

  • d
  • ! χ2

k2,

(5) so P

  • c

M = M1jM1

  • =

P (AIC1 < AIC2jM1) = P

  • log
  • b

σ2

1

  • + 2k1

n < log

  • b

σ2

2

  • + 2k1 + k2

n

  • M1
  • =

P (LR < 2k2jM1) ! P

  • χ2

k2 < 2k2

  • < 1.

Ping Yu (HKU) Additional Topics 27 / 49

slide-29
SLIDE 29

Model Selection

Consistency of the BIC

The BIC is based on BICm = log

  • b

σ2

m

  • + lognkm

n . (6) Since log(n) > 2 (if n > 8), the BIC places a larger penalty than the AIC on the number of estimated parameters and is more parsimonious. Consistency: Because (5) holds under M1, LR log(n)

p

  • ! 0,

so P

  • c

M = M1jM1

  • =

P (BIC1 < BIC2jM1) = P (LR < log(n)k2jM1) = P

  • LR

log(n) < k2jM1

  • ! P (0 < k2) = 1.

Also under M2, one can show that LR log(n)

p

  • ! ∞,

thus P

  • c

M = M2jM2

  • = P

LRn log(n) > k2

  • M2
  • ! 1.

Ping Yu (HKU) Additional Topics 28 / 49

slide-30
SLIDE 30

Model Selection

Difference between the AIC and BIC

Essentially, to consistently select M1, we must let the significance level of the LR test approach zero to asymptotically avoid choosing a model that is too large, so the critical value (or the penalty) must diverge to infinity. On the other hand, to consistently select M2, the penalty must be o(n) so that LR divided by the penalty converges in probability to infinity under M2. Compared with the fixed penalty scheme such as the AIC, the consistent selection procedures sacrifice some power to exchange for an asymptotically zero type I error. Although the AIC tends to overfit, this need not be a defect of the AIC. The AIC is derived as an estimate of the Kullback-Leibler information distance KLIC(M ) = E [logf(yjX) logf(yjX,M )] between the true density and the model

  • density. In other words, the AIC attempts to select a good approximating model for
  • inference. In contrast, the BIC and other IC attempt to estimate the "true" model.

Ping Yu (HKU) Additional Topics 29 / 49

slide-31
SLIDE 31

Model Selection

Figure: Difference Between AIC and Other IC

Ping Yu (HKU) Additional Topics 30 / 49

slide-32
SLIDE 32

Model Selection

Ordered and Unordered Regressors

The general problem is the model yi = β 1x1i + β 2x2i + ,β K xKi + ui,E[uijxi] = 0 and the question is which subset of the coefficients are non-zero (equivalently, which regressors enter the regression). Ordered regressors: M1 : β 1 6= 0,β 2 = β 3 = = β K = 0, M2 : β 1 6= 0,β 2 6= 0,β 3 = = β K = 0, . . . MK : β 1 6= 0,β 2 6= 0, ,β K 6= 0, which are nested. The AIC estimates the K models by OLS, stores the residual variance b σ2 for each model, and then selects the model with the lowest AIC. Similarly for the BIC. Unordered regressors: a model consists of any possible subset of the regressors fx1i, ,xKig, but there are 2K such models, which can be a very large number; e.g., 210 = 1024, and 220 = 1,048,576. So this case seems computationally prohibitive.

Ping Yu (HKU) Additional Topics 31 / 49

slide-33
SLIDE 33

Model Selection

Recent Developments

There is no clear answer as to which IC, if any, should be preferred. From a decision-theoretic viewpoint, the choice of the model from a set of models should depend on the intended use of the model. For example, the purpose of the model may be to summarize the main features of a complex reality, or to predict some outcome, or to test some important hypothesis. Claeskens and Hjort (2003) propose the focused information criterion (FIC) that focuses on the parameter singled out for interest. This criterion seems close to the target-based principle. Shrinkage: continuously shrink the coefficients rather than discretely select the variables, e.g., LASSO, SCAD and MCP Model Averaging

Ping Yu (HKU) Additional Topics 32 / 49

slide-34
SLIDE 34

Generalized Least Squares

Generalized Least Squares

Ping Yu (HKU) Additional Topics 33 / 49

slide-35
SLIDE 35

Generalized Least Squares

The Weighted Least Squares (WLS) Estimator

Recall that when the only information is E[xiui] = 0, the LSE is semi-parametrically efficient. When extra information E[uijxi] = 0 is available, the LSE is generally inefficient, while the WLS estimator is efficient. The WLS estimator: β =

  • X0D1X

1 X0D1y

  • ,

(7) where D = diag n σ2

1, ,σ2 n

  • and σ2

i = σ2(xi) = E[u2 i jxi].

Intuition: β = argmin

β n

i=1

  • yi x0

2 1 σ2

i

= argmin

β n

i=1

yi σi x0

i

σi β 2 , where the objective function takes the form of weighted sum of squared residuals, and the model yi σi = x0

i

σi β + ui σi is homoskedastic (why?). Under homoskedasticity, the Gauss-Markov theorem implies the efficiency of the LSE which is the WLS estimator in the original model.

Ping Yu (HKU) Additional Topics 34 / 49

slide-36
SLIDE 36

Generalized Least Squares

The Feasible GLS Estimator

A feasible GLS (FGLS) estimator replaces the unknown D with an estimate b D = diag n b σ2

1, , b

σ2

n

  • .

Model the conditional variance using the parametric form σ2

i = α0 + z0 1iα1 = α0zi,

where z1i is some q 1 function of xi. Typically, z1i are squares (and perhaps levels) of some (or all) elements of xi. Often the functional form is kept simple for parsimony. Let ηi = u2

i . Then

E [ηijxi] = α0 + z0

1iα1

and we have the regression equation ηi = α0 + z0

1iα1 + ξ i,

(8) E [ξ ijxi] = 0. This regression error ξ i is generally heteroskedastic and has the conditional variance Var (ξ ijxi) = Var

  • u2

i jxi

  • = E

h u4

i jxi

i

  • E

h u2

i jxi

i2 .

Ping Yu (HKU) Additional Topics 35 / 49

slide-37
SLIDE 37

Generalized Least Squares

Skedastic Regression

Suppose ui (and thus ηi) were observed. Then we could estimate α by OLS: α =

  • Z0Z

1 Z0η and p n(α α)

d

  • ! N (0,Vα),

where Vα =

  • E
  • ziz0

i

1 E h ziz0

iξ 2 i

i E

  • ziz0

i

1 . While ui is not observed, we have the OLS residual b ui = yi x0

i b

β = ui xi(b β β). φi = b ηi ηi = b u2

i u2 i = 2uix0 i(b

β β) + (b β β)0xix0

i(b

β β), 1 pn

n

i=1

ziφi = 2 n

n

i=1

ziuix0

i

p n(b β β) + 1 n

n

i=1

zi(b β β)0xix0

i

p n(b β β)

p

  • ! 0.

Let e α =

  • Z0Z

1 Z0b η (9) be from OLS regression of b ηi on zi. Then p n(e α α) = p n(α α) +

  • n1Z0Z

1 n1/2Z0φ

d

  • ! N (0,Vα).

(10) Thus the fact that ηi is replaced with b ηi is asymptotically irrelevant. We call (9) the skedastic regression, as it is estimating σ2().

Ping Yu (HKU) Additional Topics 36 / 49

slide-38
SLIDE 38

Generalized Least Squares

Practical Consideration

Estimate σ2

i = z0 iα by

e σ2

i = e

α0zi. (11) Suppose that e σ2

i > 0 for all i. Then set

e D = diag n e σ2

1, , e

σ2

n

  • and the FGLS estimator

e β =

  • X0e

D1X 1 X0e D1y

  • .

We can iterate between e D and e β until convergence. If e σ2

i < 0, or e

σ2

i 0 for some i, use a trimming rule

σ2

i = max

n e σ2

i ,σ2o

for some σ2 > 0.

Ping Yu (HKU) Additional Topics 37 / 49

slide-39
SLIDE 39

Generalized Least Squares

Asymptotics for the FGLS Estimator

Theorem If the skedastic regression is correctly specified, p n

  • β e

β

  • p
  • ! 0,

and thus p n e β β

  • d
  • ! N(0,V),

where V = E h σ2

i

xixi i1 . The natural estimator of the asymptotic variance of e β is e V0 = 1 n

n

i=1

e σ2

i

xix0

i

!1 = 1 nX0e D1X 1 , which is consistent for V as n ! ∞.

Ping Yu (HKU) Additional Topics 38 / 49

slide-40
SLIDE 40

Generalized Least Squares

Asymptotics for the Quasi-FGLS Estimator

If α0zi is only an approximation to the true conditional variance σ2

i , we have shown

in the last chapter that V =

  • E

h α0zi 1 xix0

i

i1 E h α0zi 2 σ2

i xix0 i

i E h α0zi 1 xix0

i

i1 . V takes a sandwich form similar to the covariance matrix of the OLS estimator. Unless σ2

i = α0zi, e

V0 is inconsistent for V. An appropriate solution is to use a White-type estimator in place of e V0, e V = 1 n

n

i=1

e σ2

i

xix0

i

!1 1 n

n

i=1

e σ4

i

b u2

i xix0 i

! 1 n

n

i=1

e σ2

i

xix0

i

!1 = n

  • X0e

D1X 1 X0e D1b De D1X

  • X0e

D1X 1 where b D = diag n b u2

1, ,b

u2

n

  • .

This estimator is robust to misspecification of the conditional variance.

Ping Yu (HKU) Additional Topics 39 / 49

slide-41
SLIDE 41

Generalized Least Squares

Choice Between the FGLS and OLS

FGLS is asymptotically superior to OLS, but we do not exclusively estimate regression models by FGLS. First, FGLS estimation depends on specification and estimation of the skedastic

  • regression. Since the form of the skedastic regression is unknown, and it may be

estimated with considerable error, the estimated conditional variances may contain more noise than information about the true conditional variances. In this case, FGLS perfoms worse than OLS in practice. Second, individual estimated conditional variances may be negative, and this requires trimming to solve. This introduces an element of arbitrariness to empirical researchers. Third, OLS is a more robust estimator of the parameter vector. It is consistent not

  • nly in the regression model (E[ujx] = 0), but also under the assumptions of linear

projection (E[xu] = 0). The GLS and FGLS estimators, on the other hand, require the assumption of a correct conditional mean. The point is that the efficiency gains from FGLS are built on the stronger assumption of a correct conditional mean, and the cost is a reduction of robustness to misspecification.

Ping Yu (HKU) Additional Topics 40 / 49

slide-42
SLIDE 42

Testing for Heteroskedasticity

Testing for Heteroskedasticity

Ping Yu (HKU) Additional Topics 41 / 49

slide-43
SLIDE 43

Testing for Heteroskedasticity

General Idea

If heteroskedasticity is present, more efficient estimation is possible, so we need test for heteroskedasticity. Heteroskedasticity may come from many resources, e.g., random coefficients, misspecification, stratified sampling, etc. So rejection of the null may be an indication of other deviations from our basic assumptions. The hypothesis of homoskedasticity is that E[u2jx] = σ2, or equivalently that H0 : α1 = 0 in the skedastic regression (8). So We may test this hypothesis by the Wald test. We usually impose the stronger hypothesis and test the hypothesis that ui is independent of xi, in which case ξ i is independent of xi and the asymptotic variance for e α simplifies to Vα = E

  • ziz0

i

1 E h ξ 2

i

i . Theorem Under H0 and ui independent of xi, the Wald test of H0 is asymptotically χ2

q.

Most tests for heteroskedasticity take this basic form. The main difference between popular "tests" lies in which transformation of xi enters zi.

Ping Yu (HKU) Additional Topics 42 / 49

slide-44
SLIDE 44

Testing for Heteroskedasticity

The Breusch-Pagan Test

Breusch-Pagan (1979) assume ui follows N(0,α0zi) and use the Lagrange Multiplier (LM) test to check whether α1 = 0: LM = 1 2b σ4

n

i=1

zifi !0

n

i=1

ziz0

i

!1

n

i=1

zifi ! , where fi = b u2

i b

σ2. This LM test statistic is similar to the Wald test statistic; a key difference is that b E h ξ 2

i

i is replaced by 2b σ4 which is a consistent estimator of E h ξ 2

i

i under H0 where ui follows N(0,σ2). Koenker (1981) shows that the asymptotic size and power of the Breusch-Pagan test is extremely sensitive to the kurtosis of the distribution of ui, and suggests to renormalize LM by n1 ∑n

i=1 f 2 i rather than 2b

σ4 to achieve the correct size.

Ping Yu (HKU) Additional Topics 43 / 49

slide-45
SLIDE 45

Testing for Heteroskedasticity

The White Test

White (1980) observes that when H0 holds, b Ω = n1 ∑n

i=1 xix0 ib

u2

i and

b σ2 b Q =

  • n1 ∑n

i=1 b

u2

i

  • n1 ∑n

i=1 xix0 i

  • should have the same probability limit, so

the difference of them should converge to zero. Collecting non-redundant elements of xix0

i, denoted as zi = (1,z0 1i)0 as above, we

are testing whether Dn = n1 ∑n

i=1 z1i

  • b

u2

i b

σ2 0. Under the auxiliary assumption that ui is independent of xi, nD0

n b

B1

n Dn d

  • ! χ2

q

under H0, where b Bn = 1 n

n

i=1

  • b

u2

i b

σ22 (z1i z1)(z1i z1)0 is an estimator of the asymptotic variance matrix of pnDn.

Ping Yu (HKU) Additional Topics 44 / 49

slide-46
SLIDE 46

Testing for Heteroskedasticity

Simplification

Given that ui is independent of xi, b Bn can be replaced by e Bn = 1 n

n

i=1

  • b

u2

i b

σ22 1 n

n

i=1

(z1i z1)(z1i z1)0 . From Exercise 7, Koenker and White’s test statistics can be expressed in the form

  • f nR2 in some regression.

The Breusch-Pagan and White tests have degrees of freedom that depend on the number of regressors in E[yjx]. Sometimes we want to conserve on degrees of freedom. A test that combines features of the Breusch-Pagan and White tests but has only two dfs takes z1i = (b yi,b y2

i )0, where b

yi are the OLS fitted values. This reduced White test has some similarity to the RESET test. nR2 from b u2

i on 1,b

yi,b y2

i has a limiting χ2 2 distribution under H0.

Ping Yu (HKU) Additional Topics 45 / 49

slide-47
SLIDE 47

Regression Intervals and Forecast Intervals

Regression Intervals and Forecast Intervals

Ping Yu (HKU) Additional Topics 46 / 49

slide-48
SLIDE 48

Regression Intervals and Forecast Intervals

Regression Intervals

All previous sections consider the interval validity. We now consider the external validity, i.e., prediction. Suppose we want to estimate m(x) = E[yijxi = x] = x0β at a particular point x (which may or may not be the same as some xi) and note that this is a (linear) function of β. Letting r(β) = x0β and θ = r(β), we see that b m(x) = b θ = x0b β and R = x, so s(b θ) = p n1x0b

  • Vx. Thus an asymptotic 95% confidence interval for m(x) is

h x0b β 2 p n1x0b Vx i . Viewed as a function of x, the width of the confidence set is dependent on x. Figure 5: the confidence bands take a hyperbolic shape, which means that the regression line is less precisely estimated for very large and very small values of x. For a given value of xi = x, we may also want to forecast (guess) yi

  • ut-of-sample.1 A reasonable forecast is still b

m(x) = x0b β since m(x) is the mean-square-minimizing forecast of yi.

1x cannot be the same as any xi observed, why? Ping Yu (HKU) Additional Topics 47 / 49

slide-49
SLIDE 49

Regression Intervals and Forecast Intervals

Figure: Typical Regression Intervals

Ping Yu (HKU) Additional Topics 48 / 49

slide-50
SLIDE 50

Regression Intervals and Forecast Intervals

Forecast Intervals

The forecast error is b ui = yi b m(x) = ui x0(b β β). As the out-of-sample error ui is independent of the in-sample estimate b β, this has variance E h b u2

i

i = E h u2

i jxi = x

i + x0E h (b β β)(b β β)0i x = σ2(x) + n1x0Vx. Assuming E h u2

i jxi

i = σ2, the natural estimate of this variance is b σ2 +n1x0b Vx, so a standard error for the forecast is b s(x) = q b σ2 + n1x0b Vx. This is different from the standard error for the conditional mean. If we have an estimate of the conditional variance function, e.g., e σ2(x) = e α0z, then the forecast standard error is b s(x) = q e σ2(x) + n1x0b Vx. A natural asymptotic 95% forecast interval for yi is h x0b β 2b s(x) i , but its validity is based on the asymptotic normality of the studentized ratio, i.e., the asymptotic normality of uix0(b

ββ) b s(x)

. But no such asymptotic approximation can be made unless ui N(0,σ2) which is generally invalid. To get an accurate forecast interval, we need to estimate the conditional distribution of ui given xi = x, which is hard. Usually, people focus on the simple approximate interval h x0b β 2b s(x) i .

Ping Yu (HKU) Additional Topics 49 / 49