[PDF] - Generalized Method of Moments (GMM) Estimation Heino Bohn Nielsen PDF Document

SLIDE 1

Econometrics 2

Generalized Method of Moments (GMM) Estimation

Heino Bohn Nielsen

1 of 35

Outline

(1) Introduction and motivation (2) Moment Conditions and Identification (3) A Model Class: Instrumental Variables (IV) Estimation (4) Method of Moment (MM) Estimation

Examples: Mean, OLS and Linear IV

(5) Generalized Method of Moment (GMM) Estimation

Properties: Consistency and Asymptotic Distribution

(6) Efficient GMM

Examples: Two-Stage Least Squares

(7) Comparison with Maximum Likelihood

Pseudo-ML Estimation

(8) Empirical Example: C-CAPM Model

2 of 35

SLIDE 2

Introduction

Generalized method of moments (GMM) is a general estimation principle. Estimators are derived from so-called moment conditions. Three main motivations:

(1) Many estimators can be seen as special cases of GMM.

Unifying framework for comparison.

(2) Maximum likelihood estimators have the smallest variance in the class of consistent

and asymptotically normal estimators. But: We need a full description of the DGP and correct specification. GMM is an alternative based on minimal assumptions.

(3) GMM estimation is often possible where a likelihood analysis is extremely difficult.

We only need a partial specification of the model. Models for rational expectations.

3 of 35

Moment Conditions and Identification

A moment condition is a statement involving the data and the parameters:

g(θ0) = E[f(wt, zt, θ0)] = 0.

(∗) where θ is a K × 1 vector of parameters with true value θ0; f(·) is an R × 1 vector

f (non-linear) functions; wt contains model variables; and zt contains instruments.
If we knew the expectation then we could solve the equations in (∗) to find θ0.
If there is a unique solution, so that

E[f(wt, zt, θ)] = 0

if and only if

θ = θ0,

then we say that the system is identified.

Identification is essential for doing econometrics. Two ideas:

(1) Is the model constructed so that θ0 is unique (identification). (2) Are the data informative enough to determine θ0 (empirical identification).

4 of 35

SLIDE 3

Instrumental Variables Estimation

In many applications, the moment condition has the specific form:

f(wt, zt, θ) = u(wt, θ) | {z }

(1×1)

· zt |{z}

(R×1)

,

where the R instruments in zt are multiplied by the disturbance term, u(wt, θ).

You can think of u(wt, θ) as the equivalent of an error term.

The moment condition becomes

g(θ0) = E[u(wt, θ0) · zt] = 0,

stating that the instruments are uncorrelated with the error term of the model.

This class of estimators is referred to as instrumental variables estimators.

The function u(wt, θ) may be linear or non-linear in θ.

5 of 35

Example: Moment Condition From RE

Consider a monetary policy rule, where the interest rate depends on expected future

inflation:

rt = β · E[πt+1 | It] + t.

Noting that

xt+1 = E[xt+1 | It] + vt,

where vt is the expectation error, we can write the model as

rt = β · E[πt+1 | It] + t = β · xt+1 + (t − β · vt) = β · xt+1 + ut.

Note that xt+1 and ut are correlated, so OLS does not work.

Under rational expectations, the expectation error, vt, should be orthogonal to the

information set, It, and for zt ∈ It we have the moment condition

E[ut · zt] = E[(rt − β · xt+1) · zt] = 0.

This is enough to identify β.

6 of 35

SLIDE 4

Method of Moments (MM) Estimator

For a given sample, wt and zt (t = 1, 2, ..., T), we cannot calculate the expectation.

We replace with sample averages to obtain the analogous sample moments:

gT(θ) = 1 T

T

X

t=1

f(wt, zt, θ).

We can derive an estimator, b

θMM, as the solution to gT(b θMM) = 0.

To find an estimator, we need at least as many equations as we have parameters.

The order condition for identification is R ≥ K. — R = K is called exact identification. The estimator is denoted the method of moments estimator, b

θMM.

— R > K is called over-identification. The estimator is denoted the generalized method of moments estimator, b

θGMM.

7 of 35

Example: MM Estimator of the Mean

Assume that yt is random variable drawn from a population with expectation µ0.

We have a single moment condition:

g(µ0) = E[f(yt, µ0)] = E[yt − µ0] = 0,

where f(yt, µ0) = yt − µ0.

For a sample, y1, y2, ..., yT, we state the corresponding sample moment conditions:

gT(b µ) = 1 T

T

X

t=1

(yt − b µ) = 0.

The MM estimator of the mean µ0 is the solution, i.e.

b µMM = 1 T

T

X

t=1

yt,

which is the sample average.

8 of 35

SLIDE 5

Example: OLS as a MM Estimator

Consider the linear regression model of yt on xt (K × 1):

yt = x0

tβ0 + t.

(∗∗) Assume that (∗∗) represents the conditional expectation:

E[yt | xt] = x0

tβ0

so that

E[t | xt] = 0.

That implies the K unconditional moment conditions

g(β0) = E[xtt] = E [xt (yt − x0

tβ0)] = 0,

which we recognize as the minimal assumption for consistency of the OLS estimator.

9 of 35

We define the corresponding sample moment conditions as

X

t=1

xtyt,

provided that PT

t=1 xtx0 t is non-singular.

Method of moments is one way to motivate the OLS estimator.

Highlights the minimal (or identifying) assumptions for OLS.

10 of 35

SLIDE 6

Example: Under-Identification

Consider again a regression model

yt = x0

tβ0 + t = x0 1tγ0 + x0 2tδ0 + t.

Assume that the K1 variables in x1t are predetermined, while the K2 = K − K1

variables in x2t are endogenous. That implies

E[x1tt] = 0 (K1 × 1)

(†)

E[x2tt] 6= 0 (K2 × 1).

(††)

We have K parameters in β0 = (γ0

0, δ0 0)0, but only K1 < K moment conditions

(i.e. K1 equations to determine K unknowns). The parameters are not identified and cannot be estimated consistently.

11 of 35

Example: Simple IV Estimator

Assume K2 new variables, z2t, that are correlated with x2t but uncorrelated with t:

E[z2tt] = 0.

(†††) The K2 moment conditions in (†††) can replace (††). To simplify notation, we define

xt

(K×1)

= µ x1t x2t ¶

and

zt

(K×1)

= µ x1t z2t ¶ . xt are model variables, z2t are new instruments, and zt are instruments.

We say that x1t are instruments for themselves.

Using (†) and (†††) we have K moment conditions:

g(β0) = µ E[x1tt] E[z2tt] ¶ = E[ztt] = E[zt (yt − x0

tβ0)] = 0,

which are sufficient to identify the K parameters in β.

12 of 35

SLIDE 7

The corresponding sample moment conditions are given by

t=1 ztx0 t is non-singular.

Note the following:

(1) We need the instruments to identify the parameters. (2) The MM estimator coincides with the simple IV estimator. (3) The procedure only works with K2 new instruments (i.e. R = K). (4) Non-singularity of PT

t=1 ztx0 t requires relevant instruments.

13 of 35

Generalized Method of Moments Estimation

The case R > K is called over-identification.

More equations than parameters and no solution to gT(θ) = 0 in general.

Instead we minimize the distance from gT(θ) to zero.

The distance is measured by the quadratic form

QT(θ) = gT(θ)0WTgT(θ),

where WT is an R × R symmetric and positive definite weight matrix.

The GMM estimator depends on the weight matrix:

b θGMM(WT) = arg min

θ

{gT(θ)0WTgT(θ)} .

14 of 35

SLIDE 8

Distances and Weight Matrices

Consider a simple example with 2 moment conditions

gT(θ) = µ ga gb ¶ ,

where the dependence of T and θ is suppressed.

First consider a simple weight matrix, WT = I2 :

QT(θ) = gT(θ)0WTgT(θ) = ¡ ga gb ¢ µ 1 0 0 1 ¶ µ ga gb ¶ = g2

a + g2 b,

which is the square of the simple distance from gT(θ) to zero. Here the coordinates are equally important.

Alternatively, look at a different weight matrix:

QT(θ) = gT(θ)0WTgT(θ) = ¡ ga gb ¢ µ 2 0 0 1 ¶ µ ga gb ¶ = 2 · g2

a + g2 b,

which attaches more weight to the first coordinate in the distance.

15 of 35

Consistency: Why Does it Work?

Assume that a law of large numbers (LLN) applies to f(wt, zt, θ), i.e.

T −1

T

X

t=1

f(wt, zt, θ) → E[f(wt, zt, θ)]

for

T → ∞.

That requires IID or stationarity and weak dependence.

If the moment conditions are correct, g(θ0) = 0, then GMM is consistent,

b θGMM(WT) → θ0

as

T → ∞,

for any WT positive definite.

Intuition: If a LLN applies, then gT(θ) converges to g(θ).

Since b

θGMM(WT) minimizes the distance from gT(θ) to zero, it will be a consistent

estimator of the solution to g(θ0) = 0.

The weight matrix, WT, has to be positive definite, so that we put a positive and

non-zero weight on all moment conditions.

16 of 35

SLIDE 9

Asymptotic Distribution

Assume a central limit theorem for f(wt, zt, θ), i.e.:

√ T · gT(θ0) = 1 √ T

T

X

t=1

f(wt, zt, θ0) → N(0, S),

where S is the asymptotic variance.

Then it holds that for any positive definite weight matrix, W, the asymptotic distri-

bution of the GMM estimator is given by

√ T ³ b θGMM − θ0 ´ → N(0, V ).

The asymptotic variance is given by

V = (D0WD)−1 D0WSWD (D0WD)−1 ,

where

D = E ∙∂f(wt, zt, θ) ∂θ0 ¸

is the expected value of the R × K matrix of first derivatives of the moments.

17 of 35

Efficient GMM Estimation

The variance of b

θGMM depends on the weight matrix, WT.

The efficient GMM estimator has the smallest possible (asymptotic) variance.

Intuition: a moment with small variance is informative and should have large weight.

It can be shown that the optimal weight matrix, W opt

T , has the property that

plim W opt

T

= S−1.

With the optimal weight matrix, W = S−1, the asymptotic variance simplifies to

V = ¡ D0S−1D ¢−1 D0S−1SS−1D ¡ D0S−1D ¢−1 = ¡ D0S−1D ¢−1 .

The best moment conditions have small S and large D.

— A small S means that the sample variation of the moment (noise) is small. — A large D means that the moment condition is much violated if θ 6= θ0. The moment is very informative on the true values, θ0. Related to the curvature of the criteria function as in ML.

18 of 35

SLIDE 10

Hypothesis testing can be based on the asymptotic distribution:

b θGMM

a

∼ N(θ0, T −1b V ).

An estimator of the asymptotic variance is given by

b V = ¡ D0

TS−1 T DT

¢−1 ,

where

DT |{z}

(R×K)

= ∂gT(θ) ∂θ0 = 1 T

T

X

t=1

∂f(wt, zt, θ) ∂θ0

is the sample average of the first derivatives. And ST is an estimator of S = T · V [gT(θ)]. If the observations are independent, a consistent estimator is

ST = 1 T

T

X

t=1

f(wt, zt, θ)f(wt, zt, θ)0.

Estimation of the weight matrix is typically the most tricky part of GMM.

19 of 35

Computational Issues

The estimator is defined by minimizing QT(θ). Minimization can be done by

∂QT(θ) ∂θ = ∂(gT(θ)0WTgT(θ)) ∂θ =

(K×1).

Sometimes analytically but often by numerical optimization.

We need an optimal weight matrix, W opt

T , but that depends on the parameters!

Two-step efficient GMM:

(1) Choose an initial weight matrix, e.g. W[1] = IR, and find a consistent but inefficient

first-step GMM estimator

b θ[1] = arg min

θ

gT(θ)0W[1]gT(θ).

(2) Find the optimal weight matrix, W opt

[2] , based on b

θ[1]. Find the efficient estimator b θ[2] = arg min

θ

gT(θ)0W opt

[2] gT(θ).

The estimator is not unique as it depends on the initial weight matrix W[1].

20 of 35

SLIDE 11

Iterated GMM estimator:

From the estimator b

θ[2] it is natural to update the weights, W opt

[3] , and update b

θ[3].

We can switch between estimating W opt

[·]

and b

θ[·] until convergence.

Iterated GMM does not depend on the initial weight matrix. The two approaches are asymptotically equivalent. Continuously updated GMM estimator:

A third approach is to recognize from the outset that the weight matrix depends on

the parameters, and minimize

QT(θ) = gT(θ)0WT(θ)gT(θ).

That is never possible to solve analytically.

21 of 35

Test of Overidentifying Moment Conditions

Recall that K moment conditions are sufficient to estimate the K parameters in θ.

If R > K, we can test the validity of the R−K overidentifying moment conditions.

By MM estimation we can set K moment conditions equal to zero.

If all R conditions are valid then the R − K moments should also be close to zero.

From CLT we have

gT(θ0)

a

∼ N(0, T −1S).

If we use the optimal weights, W opt

T

→ S−1, then ξJ = T · gT(b θGMM)0W opt

T gT(b

θGMM) = T · QT(b θGMM) → χ2(R − K).

This is the J-test or the Hansen test for overidentifying restrictions.

In linear models it is often referred to as the Sargan test.

ξJ is not a test of the validity of model or the underlying economic theory. ξJ considers whether the R−K moments are in line with the K identifying moments.

22 of 35

SLIDE 12

Example: The C-CAPM Model

Consider the consumption based capital asset pricing (C-CAPM) model of Hansen

and Singleton (1982).

A representative agent maximizes the discounted value of lifetime utility subject to

a budget constraint:

max

∞

X

s=1

E [δs · u(ct+s) | It] , At+1 = (1 + rt+1) At + yt+1 − ct+1,

where At is financial wealth, yt is income, 0 ≤ δ ≤ 1 is a discount factor, and It is the information set at time t.

The first order condition is given by the Euler equation:

u0(ct) = E [δ · u0(ct+1) · Rt+1 | It] ,

where u0(·) is the derivative, and Rt+1 = 1 + rt+1 is the return factor.

23 of 35

Now assume a constant relative risk aversion (CRRA) utility function:

u(ct) = c1−γ

t

1 − γ, γ < 1,

so that u0(ct) = c−γ

t . That gives the explicit Euler equation:

c−γ

t

− E £ δ · c−γ

t+1 · Rt+1 | It

¤ = 0.

To ensure stationarity, we reformulate:

E " δ · µct+1 ct ¶−γ · Rt+1 − 1 | It # = 0,

which is a conditional moment condition.

That implies the unconditional moment conditions

E [f (ct+1, ct, Rt+1; zt; δ, γ)] = E "Ã δ · µct+1 ct ¶−γ · Rt+1 − 1 ! zt # = 0,

for all variables zt ∈ It included in the formation set.

24 of 35

SLIDE 13

To estimate the parameters, θ = (δ, γ)0, we need at least R = 2 instruments in zt.

We try with R = 3 instruments:

zt = µ 1, ct ct−1 , Rt ¶0 .

That produces the moment conditions

E "Ã δ · µct+1 ct ¶−γ · Rt+1 − 1 ! # = 0 E "Ã δ · µct+1 ct ¶−γ · Rt+1 − 1 ! µ ct ct−1 ¶# = 0 E "Ã δ · µct+1 ct ¶−γ · Rt+1 − 1 ! Rt # = 0,

for t = 1, 2, ..., T.

The model is formally identified but γ is poorly determined.

CU

HAC 2 0.9952

(0.0048)

−0.9094

(1.9108)

236 3.591 3 0.309

26 of 35

SLIDE 14

Weight Matrix Estimation (Univariate Case)

The optimal weight matrix is S−1

heteroskedasticity consistent (HC) covariance estimator.

27 of 35

If ft and fs are correlated, the variance includes the covariances:

S = 1 T · V " T X

t=1

ft # = 1 T [V (ft) + 2 · Cov(ft, ft−1) + 2 · Cov(ft, ft−2) + ...] .

The heteroskedasticity and autocorrelation consistent (HAC) variance estimator is

ST = b V (ft) +

T−1

X

j=1

2 · d Cov(ft, ft−j),

where

d Cov(ft, ft−j) = 1 T

T

X

t=j+1

ftft−j.

Problems:

(1) We cannot estimate as many covariances as observations. (2) The simple HAC estimator is not necessarily positive definite.

28 of 35

SLIDE 15

We use a weight wj on covariance j, and let wt go to zero as j increases.

This class of so-called kernel estimators can be written as

ST = b V (ft) +

T−1

X

j=1

wj · 2 · d Cov(ft, ft−j),

where wj = k

¡ j

B

¢

. k(·) is a kernel function and B is the bandwidth parameter.

Example: Bartlett kernel (Newey-West estimator)

Weights in the Bartlett kernel, B=6 1 Lags

8
6
4
2

2 4 6 8

29 of 35

Example: 2SLS

Consider again a regression model

yt = x0

tβ0 + t = x0 1tγ0 + x0 2tδ0 + t,

where E[x1tt] = 0 and E[x2tt] 6= 0. Assume that you have R > K valid instruments in zt so that

g(β0) = E[ztt] = E[zt (yt − x0

tβ0)] = 0.

The corresponding sample moments are given by

gT(β) | {z }

(R×1)

= 1 T

T

X

t=1

zt (yt − x0

tβ) = 1

T Z0 (Y − Xβ) ,

where Y (T × 1), X (T × K), and Z (T × R) are the stacked matrices.

In this case we cannot solve gT(β) = 0 directly; Z0X is R × K and not invertible.

30 of 35

SLIDE 16

Instead, we want to derive the GMM estimator by minimizing the criteria function

QT(β) = gT(β)0WTgT(β) = ¡ T −1Z0 (Y − Xβ) ¢0 WT ¡ T −1Z0 (Y − Xβ) ¢ = T −2 ¡ Y 0ZWTZ0Y − 2β0X0ZWTZ0Y + β0X0ZWTZ0Xβ ¢ .

We take the first derivative, and the GMM estimator is the solution to

∂QT(β) ∂β = −2T −2X0ZWTZ0Y + 2T −2X0ZWTZ0Xβ = 0.

We find b

βGMM(WT) = (X0ZWTZ0X)−1 X0ZWTZ0Y , depending on WT.

To estimate the optimal weight matrix, W opt

tztz0 t,

which allows for general heteroskedasticity of the disturbance term.

31 of 35

For the asymptotic distributions, we recall that

b βGMM

a

∼ N ³ β0, T −1 ¡ D0S−1D ¢−1´ .

The derivative is given by

DT

(R×K)

= ∂gT(β) ∂β0 = ∂ ³ T −1 PT

t=1 zt (yt − x0 tβ)

´ ∂β0 = −T −1

T

X

t=1

ztx0

ztx0

t

!−1 .

Note that this is the heteroskedasticity consistent (HC) variance estimator (White).

GMM with allowance for heteroskedastic errors automatically produces heteroskedasticity consistent standard errors!

32 of 35

SLIDE 17

If we assume that the error terms are IID, the optimal weight matrix simplifies to

ST = b σ2 T

T

X

t=1

ztz0

t = T −1b

σ2Z0Z,

where b

σ2 is a consistent estimator for σ2.

In this case the efficient GMM estimator becomes

b βGMM = ¡ X0ZS−1

T Z0X

¢−1 X0ZS−1

T Z0Y.

= ³ X0Z ¡ T −1b σ2Z0Z ¢−1 Z0X ´−1 X0Z ¡ T −1b σ2Z0Z ¢−1 Z0Y = ³ X0Z (Z0Z)−1 Z0X ´−1 X0Z (Z0Z)−1 Z0Y,

which is identical to the two stage least squares (2SLS) estimator.

The variance of the estimator is

V h b βGMM i = T −1 ¡ D0

TS−1 T DT

¢−1 = b σ2(X0Z (Z0Z)−1 Z0X)−1,

which again coincides with the 2SLS variance.

33 of 35

Pseudo-ML (PML) Estimation

The first order conditions for ML estimation can be seen as a sample counterpart to

a moment condition

1 T s (θ) = 1 T

T

X

t=1

st (θ) = 0

corresponds to

E[st (θ)] = 0,

and ML becomes a special case of GMM.

b

θML is consistent for weaker assumptions than maintained by ML.

The FOC for a normal regression model corresponds to

E[xt(yt − x0

tβ)] = 0,

which is weaker than the assumption that the entire distribution is correctly specified. OLS is consistent even if t is not normal.

A ML estimation that maximizes a likelihood function different from the true models

likelihood is referred to as a pseudo-ML or a quasi-ML estimator. Note that the variance matrix is no longer the inverse information.

34 of 35

SLIDE 18

(My Unfair) Comparison of ML and GMM

Maximum Likelihood Generalized Method of Moments Assumptions: Full specification. Partial specification/weak assumptions. Know Density(θ0) apart from θ0. Moment conditions: E[f(data;θ0)] = 0. Strong economic assumptions. Efficiency: Cramér Rao lower bound. Efficient based on moment condition. (Smallest possible variance). Larger than Cramér Rao. Typical Statistical description of the data. Estimate deep parameters of approach: Misspecification testing. economic model. Restrictions recover economics. Robustness: First order conditions should hold! Moment conditions should hold! PML is a GMM interpretation of ML. Weights and variances can Use larger PML variance. be made robust.

35 of 35