Econometrics 1: IV, GMM and MLE James A. Duffy 1 Oxford, Michaelmas - - PDF document

econometrics 1 iv gmm and mle
SMART_READER_LITE
LIVE PREVIEW

Econometrics 1: IV, GMM and MLE James A. Duffy 1 Oxford, Michaelmas - - PDF document

Econometrics 1: IV, GMM and MLE James A. Duffy 1 Oxford, Michaelmas 2016 (revised: 28/12/16) 1 I thank N. Geesing, L. Freund, K. Kuske, and E. Munro for comments. The manuscript was prepared with L YX 2.2.2. Contents 1 Instrumental variables


slide-1
SLIDE 1

Econometrics 1: IV, GMM and MLE

James A. Duffy1 Oxford, Michaelmas 2016 (revised: 28/12/16)

1I thank N. Geesing, L. Freund, K. Kuske, and E. Munro for comments. The manuscript was

prepared with L YX 2.2.2.

slide-2
SLIDE 2
slide-3
SLIDE 3

Contents

1 Instrumental variables 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2.1 Rank and order conditions . . . . . . . . . . . . . . . . . . . . . 1 1.2.2 A restatement of the rank condition . . . . . . . . . . . . . . . . 3 1.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 From identification to estimation . . . . . . . . . . . . . . . . . 3 1.3.2 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 The ‘exclusion restriction’ . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Another way of computing the 2SLS estimator . . . . . . . . . . . . . . 10 1.6 Testing exogeneity of the endogenous regressors (Hausman test) . . . . . 11 1.7 Testing the identifying conditions . . . . . . . . . . . . . . . . . . . . . . 12 1.7.1 Tests of overidentifying restrictions (Sargan test) . . . . . . . . . 13 1.7.2 Testing the rank condition . . . . . . . . . . . . . . . . . . . . . 15 1.8 Weak instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.8.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.8.2 Dealing with weak instruments . . . . . . . . . . . . . . . . . . . 19 1.8.3 The Anderson–Rubin (AR) test . . . . . . . . . . . . . . . . . . 20 1.A Suggested (optional) further reading . . . . . . . . . . . . . . . . . . . . 21 2 Generalised method of moments 23 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.1 Motivating examples . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.2 A general framework . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.3 Local identification and weak identification . . . . . . . . . . . . 31 2.3 Asymptotic efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3.1 The choice of weight matrix . . . . . . . . . . . . . . . . . . . . 32 2.3.2 The implied (efficient) choice of moments . . . . . . . . . . . . . 33 2.3.3 Efficiency in the linear IV model . . . . . . . . . . . . . . . . . . 34 2.4 Tests of over-identifying restrictions . . . . . . . . . . . . . . . . . . . . 35 2.5 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.1 Tests of nonlinear restrictions and the delta method . . . . . . . 38 2.5.2 GMM criterion-based tests (QLR tests) . . . . . . . . . . . . . . 41 i

slide-4
SLIDE 4

2.A Suggested (optional) further reading . . . . . . . . . . . . . . . . . . . . 43 3 Maximum likelihood 45 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1.1 Parametric and semiparametric estimation . . . . . . . . . . . . . 45 3.1.2 The likelihood function: the general case . . . . . . . . . . . . . 46 3.1.3 The likelihood function: with i.i.d. data . . . . . . . . . . . . . . 47 3.2 Univariate examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.1 Continuous random variables . . . . . . . . . . . . . . . . . . . . 48 3.2.2 Discrete random variables . . . . . . . . . . . . . . . . . . . . . 50 3.2.3 Mixed continuous/discrete random variables . . . . . . . . . . . . 52 3.3 Models with covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4 Consistency and identification . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.2 Identification via Kullback–Leibler minimisation . . . . . . . . . . 60 3.5 Asymptotic distribution of the MLE . . . . . . . . . . . . . . . . . . . . 62 3.5.1 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5.2 Efficiency properties . . . . . . . . . . . . . . . . . . . . . . . . 64 3.6 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.A Suggested (optional) further reading . . . . . . . . . . . . . . . . . . . . 67 4 References 69 A Mathematical appendix 71 A.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 A.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 A.3 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 A.3.1 Modes of stochastic convergence . . . . . . . . . . . . . . . . . 73 A.3.2 Key results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 A.4 Suggested (optional) further reading . . . . . . . . . . . . . . . . . . . . 75 ii

slide-5
SLIDE 5

ECONOMETRICS 1, MT 2016 20/04/17

  • J. A. DUFFY

1 Instrumental variables

  • Throughout these notes, all variables are i.i.d. unless otherwise stated.

1.1 Introduction

  • We would like to estimate the parameters β0 ∈ Rdx in

yi = xT

i β0 + ui

(1.1) (here, as throughout, a ‘0’ subscript denotes the true value of a parameter). – For various reasons – due to omitted variables, measurement error, unobservable heterogeneity and the like – the usual identifying orthogonality condition Exiui = 0 may not be considered plausible. – Those elements of xi for which this condition fails are said to be endogenous.

  • Instead, identification will be achieved by means of the instruments zi, which are

assumed to fulfil this same orthogonality condition,

IV-ORTH Eziui = 0.

(Note that xi and zi may share some common elements, but they cannot overlap entirely, for obvious reasons)

1.2 Identification

1.2.1 Rank and order conditions

  • A parameter is identified if it is uniquely determined from the joint distribution of the

data, wi = (yi, xi, zi). (At least in i.i.d. settings, in which the joint distribution can always be consistently estimated, this is equivalent to asking whether the parameter can be consistently estimated.)

  • Identification of β0 will follow if the equation

0 = E(yi − xT

i β)zi = Eziyi − EzixT i β

(1.2) has a unique solution at β = β0. The r.h.s. depends only on β and the distribution

  • f wi, through the moments Eyizi and EzixT

i .

1

slide-6
SLIDE 6
  • To show that β0 indeed solves (1.2), rewrite as

0 =(1) E(xT

i β0 + ui − xT i β)zi =(2) EzixT i (β0 − β),

(1.3) where =(1) follows by (1.1), and =(2) by IV-ORTH.

  • Is β = β0 the only solution to (1.3)? Since EzixT

i is a dz × dx matrix, the equation

[EzixT

i ]δ = 0

admits a solution at some δ = 0 if and only if rk EzixT

i < dx (see Appendix A.2). In

this case, there will be other β’s, distinct from β0, for which (1.3) holds.

  • A necessary and sufficient condition for identification is thus

IV-RANK rk EzixT

i = dx,

termed the rank condition (or somewhat more informally, the relevance condition). – A necessary, but not sufficient condition for the rank condition is that dz ≥ dx, termed the order condition. In other words, there must be at least as many instruments as there are regressors. – The model is said to be exactly identified when dz = dx – i.e. when we have just enough instruments to identify β0 – and overidentified when dz > dx; in the latter case, we can test for some violations of IV-ORTH.

  • In the overidentified case, we have strictly more instruments than are needed to

identify β0; and in consequence, the number of such instruments may be reduced, down to dx, without prejudicing identification.

  • More formally, for some dx×dz matrix L, consider the dx new ‘instruments’ zL,i := Lzi

formed by taking dx linear combinations of the original instruments. – zL,i clearly satisfy the required orthogonality condition, since by IV-ORTH 0 = LEziui = EzL,iui. – Similarly, premultiplying (1.2) by L yields the identifying condition 0 = EzL,iyi − EzL,ixT

i β ⇐

⇒ 0 = EzL,ixT

i (β0 − β)

where the equivalence follows via exactly the same reasoning as which led from (1.2) to (1.3) above. 2

slide-7
SLIDE 7

– Thus the instruments zL,i are sufficient to identify β0, if and only if the square matrix EzL,ixT

i = LEzixT i has full rank (i.e. rank dx).

1.2.2 A restatement of the rank condition

  • There is another way of stating the rank condition that is often more convenient.
  • First, note that for each k ∈ {1, . . . , dx}, there exists a π0,k ∈ Rdz such that

xk,i = zT

i π0,k + vk,i

(1.4) with Ezivk,i = 0; this corresponds to a population regression of xk,i on zi, i.e. π0,k = (Ezizi)−1Ezixk,i

  • Stacking the dx equations (1.4) yields xT

i = zT i Π0 + v T i , or rather

xi = ΠT

0 zi + vi

where Π0 is a dz × dx matrix whose ith column is given by π0,k; these equations are termed the reduced form or first stage equations for xi.

  • By construction,

EzixT

i = Ezi(zT i Π0 + v T i ) = (EzizT i )Π0

whence rk EzixT

i = rk(EzizT i )Π0 =(2) rk Π0

where =(2) holds if EzizT

i

is full rank – i.e. so long as the instruments are not themselves collinear (see Appendix A.2).

  • IV-RANK is therefore often restated as

IV-RANK′ rk Π0 = dx and rk EzizT

i = dz.

1.3 Estimation

1.3.1 From identification to estimation

  • How can we go from such an identifying condition as

0 = Eziyi − EzixT

i β

(1.5) to a (consistent) estimator for β? 3

slide-8
SLIDE 8
  • If dx = dz, then can solve directly for β0, to obtain

β0 = (EzixT

i )−1Eziyi,

(1.6) which has the sample analogue, ˆ βn :=

  • 1

n

n

  • i=1

zixT

i

−1 1 n

n

  • i=1

ziyi = n

  • i=1

zixT

i

−1

n

  • i=1

ziyi Consistency is then immediate from the LLN and Slutsky’s theorem (for a restate- ment of these results, and the CLT, see Section A.3.2). There is a unique instru- mental variables estimator in this case.

  • If dz > dx, then we cannot proceed in this manner; the matrix EzixT

i

in (1.5) is no longer square, and so cannot be inverted to yield such an expression as (1.6) for β0. – A possible way to proceed here is to ‘reduce’ the problem to the exactly identified case, by taking a suitable linear combination of the original instruments. – Indeed, if zL,i := LTzi is such a set of dx instruments for which EzL,ixT

i has full

rank, then 0 = EzL,iyi − EzL,ixT

i β

may be inverted to obtain an expression for β0 analogous to (1.6) above, β0 = (EzL,ixT

i )−1EzL,iyi

(1.7) – This suggests the estimator ˆ βn(L) := n

  • i=1

zL,ixT

i

−1

n

  • i=1

zL,iyi. (1.8) (Again, consistency is immediate.)

  • In other words: we have replaced the dz identifying orthogonality conditions

0 = Eziyi − EzixT

i β

by a smaller system of dx identifying conditions, 0 = L[Eziyi − EzixT

i β] = EzL,iyi − EzL,ixT i β.

4

slide-9
SLIDE 9

– Whereas the sample analogue of the overidentified system 0 = 1 n

n

  • i=1

ziyi − 1 n

n

  • i=1

zixT

i β

(1.9) does not, in general, admit an exact solution, the sample analogue of the exactly identified system 0 = 1 n

n

  • i=1

zL,iyi − 1 n

n

  • i=1

zL,ixT

i β

(1.10) does – allowing us to estimate β0 as the exact solution to (1.10).

  • This raises the question of how to choose L. First, note that:

– Although each choice of L gives distinct matrices EzL,ixT

i and EzL,iyi, the r.h.s.

  • f (1.7) always equals β0; the expression of the r.h.s. is invariant to L. On the
  • ther hand, the estimator ˆ

βn,L in (1.8) does depend on the choice of L. – This difference between (1.7) and (1.8) arises because, while the orthogonality condition IV-ORTH holds exactly in the population, it does not hold exactly in sample: that is, in general, n

i=1 ziui = 0.

– What is true, however, is that

1 n

n

i=1 ziui p

→ Eziui = 0 by the LLN, which accounts for why ˆ βn(L)

p

→ β0, regardless of L.

  • Thus, so far as consistency is concerned, the choice of L is irrelevant – it will, how-

ever, matter for efficiency: different choices of L will yield estimators with different asymptotic variances. As we shall see, under certain conditions, setting L = Π0 delivers an estimator which is optimal in this sense (see Section 2.3.3 below).

  • ˆ

βn(Π0) infeasible to compute, since Π0 is unknown: but since Π0 can be consistently estimated, by regressing each element of xi on (the whole of) zi, it has a feasible counterpart ˆ β2SLS

n

:= ˆ βn(ˆ Πn) = n

  • i=1

ˆ xixT

i

−1

n

  • i=1

ˆ xiyi =(3) n

  • i=1

ˆ xi ˆ xT

i

−1

n

  • i=1

ˆ xiyi = ( ˆ XT ˆ X)−1 ˆ XTy, where, in keeping with the usual notation, ˆ xi := ˆ ΠT

n zi

– =(3) follows by noting that the first-stage OLS residuals ˆ vi are orthogonal to 5

slide-10
SLIDE 10

zi, in the sense that n

i=1 zi ˆ

v T

i = 0, by construction, whence n

  • i=1

ˆ xixT

i = n

  • i=1

ˆ xi(ˆ xi + ˆ vi)T =

n

  • i=1

ˆ xi ˆ xT

i + ˆ

ΠT

n n

  • i=1

zi ˆ v T

i = n

  • i=1

ˆ xi ˆ xT

i .

– ˆ β2SLS

n

is termed the two-stage least squares (2SLS) estimator, because it can be computed by: first regressing each element of xi on zi, to obtain the fitted values ˆ xi, and then regressing yi on ˆ xi. – Since it will be useful in what follows, we note here that ˆ Πn = n

  • i=1

zizT

i

−1

n

  • i=1

zixT

i = (ZTZ)−1ZTX

which follows from the fact that the kth column of ˆ Πn corresponds the coeffi- cients obtained from an OLS regression of xk,i upon zi, and thus to ˆ πk,n = n

  • i=1

zizT

i

−1

n

  • i=1

zixk,i. 1.3.2 Asymptotics

  • We shall now derive the limiting distribution of the 2SLS estimator. Let ˜

xi := ΠT

0 zi

denotes the infeasible ‘ideal’ instruments, and recall ˆ xi := ˆ ΠT

n zi denotes their feasible

counterparts.

  • Application of the LLN and CLT shall require the existence of moments of a certain
  • rder; the following is sufficient for our purposes here:

IV-MOM Exi2 < ∞, Ezi4 < ∞, and E|ui|4 < ∞

(for the definition of the max-element norm ·, see Appendix A.1).

  • As with OLS, certain simplifications – and certain optimality properties – are available

when the error ui in the structural equation is homoskedastic in the sense that

IV-HMSK E[u2

i | zi] = Eu2 i =: σ2 u.

By the law of iterated expectations (LIE), this entails Eu2

i zizT i = E[ E[u2 i zizT i | zi] ] = E[ E[u2 i | zi] zizT i ] = σ2 uEzizT i

which is in fact the only implication of IV-HMSK that will be needed here. 6

slide-11
SLIDE 11

Theorem 1.1. Suppose that IV-ORTH, IV-RANK′ and IV-MOM hold. Then n1/2(ˆ βn − β0)

d

→ N[0, V ], where V = (E˜ xi ˜ xT

i )−1(Eu2 i ˜

xi ˜ xT

i )(E˜

xi ˜ xT

i )−1.

If additionally IV-HMSK holds, then V = σ2

u(E˜

xi ˜ xT

i )−1.

Proof.

  • Substituting the model yi = xT

i β0 + ui into the formula for the 2SLS estimator gives

ˆ βn = n

  • i=1

ˆ xixT

i

−1

n

  • i=1

ˆ xi(xT

i β0 + ui) = β0 +

n

  • i=1

ˆ xixT

i

−1

n

  • i=1

ˆ xiui. Rearranging, and recalling n

i=1 ˆ

xixT

i = n i=1 ˆ

xi ˆ xT

i , yields

n1/2(ˆ βn − β0) =

  • 1

n

n

  • i=1

ˆ xi ˆ xT

i

−1 1 n1/2

n

  • i=1

ˆ xiui =: M−1

n sn.

(1.11)

  • Since ˆ

xi = ˆ ΠT

n zi, and ˆ

Πn depends on the entire sample, the ˆ xi’s are not i.i.d. This pre- vents the LLN and CLT from being directly applied to deduce the limiting behaviour

  • f Mn and sn, which slightly complicates the proof.
  • Before proceeding further, we first note that

ˆ Πn =

  • 1

n

n

  • i=1

zizT

i

−1 1 n

n

  • i=1

zixT

i p

→(2) (EzizT

i )−1EzixT i = Π0

(1.12) where

p

→(2) is a consequence of Slutsky’s theorem and the LLN; the latter may be applied here since ExizT

i ≤(1) Exizi ≤(2) (Exi2)1/2(Ezi2)1/2 <(3) ∞

where ≤(1) is a property of the max-element norm; ≤(2) follows by the Cauchy- Schwarz inequality; and <(3) by IV-MOM.

  • Turning now to Mn, we have:

Mn = ˆ ΠT

n

  • 1

n

n

  • i=1

zizT

i

  • ˆ

Πn

p

→(2) ΠT

0 EzizT i Π0 = E˜

xi ˜ xT

i .

(1.13) where

p

→(2) follows from the LLN, Slutsky’s theorem and (1.12). 7

slide-12
SLIDE 12
  • By IV-RANK′, the matrix E˜

xi ˜ xT

i is invertible (see the problem set).

  • Turning next to sn, we first note that

1 n1/2

n

  • i=1

ziui

d

→ N[0, Eu2

i zizT i ]

(1.14) by the CLT, since Eziui = 0 by IV-ORTH, while Eziui2 = E|ui|2zi2 ≤ (E|ui|4)1/2(Ezi4)1/2 < ∞ by the Cauchy-Schwarz inequality and IV-MOM.

  • Together (1.12), (1.14) and an application of Slutsky’s theorem yield

sn = ˆ ΠT

n

1 n1/2

n

  • i=1

ziui

d

→ ΠT

0 · N[0, Eu2 i zizT i ] ∼ N[0, Eu2 i ˜

xi ˜ xT

i ].

(1.15)

  • Finally, (1.11), (1.13) and (1.15), and a further appeal to Slutsky’s theorem, yield

the result.

  • The limiting variance

V = (E˜ xi ˜ xT

i )−1(Eu2 i ˜

xi ˜ xT

i )(E˜

xi ˜ xT

i )−1

is exactly what we would have obtained if we had computed ˆ βn using the infeasible instruments ˜ xi, rather than their feasible counterparts. The estimation of Π0 thus has no first-order effect on the limiting behaviour of ˆ βn.

  • V can be consistently estimated by

ˆ Vn :=

  • 1

n

n

  • i=1

ˆ xi ˆ xT

i

1− 1 n

n

  • i=1

ˆ u2

i ˆ

xi ˆ xT

i

1 n

n

  • i=1

ˆ xi ˆ xT

i

−1 where, importantly, ˆ ui := yi − xT

i ˆ

βn is the residual from the structural equation. – A common mistake is to compute ˆ ui as yi − ˆ xT

i ˆ

βn, the residual from the second- stage regression of yi on ˆ

  • xi. (Using the OLS standard errors associated with

the second-stage regression would amount to committing this same error.) – The proof that ˆ Vn

p

→ V is somewhat tedious, but not conceptually difficult.

  • Theorem 1.1 provides a basis for Wald tests of linear restrictions of the form

H0 : Rβ = ρ against H1 : Rβ = ρ 8

slide-13
SLIDE 13

where R is a dr × dx matrix having rank dr (so we are testing dr linear restrictions). – By the theorem, under H0, n1/2[R ˆ βn − ρ] = R[n1/2(ˆ βn − β)]

d

→ R · N[0, V ] ∼ N[0, RV RT] – Thus, for the Wald statistic, under H0 Wn := n(R ˆ βn − ρ)T(Rˆ VnRT)−1(R ˆ βn − ρ)

d

→ χ2[dr].

1.4 The ‘exclusion restriction’

  • Returning to the structural equation:

yi = xT

i β0 + ui = xT 1iβ0,1 + xT 2iβ0,2 + ui

(1.16) Where we have partitioned xi =

  • x1i

x2i

  • dx1 – exogenous, Ex1iui = 0

dx2 – potentially endogenous, Ex2iui = 0 (?) – x1i are valid instruments; but not sufficient in number to identify β0

  • Partition the instruments:

zi =

  • z1i

z2i

  • =
  • x1i

z2i

  • dx1 – exogenous regressors

dz2 – excluded instruments

  • z2i are the excluded instruments (or more loosely, ‘the’ instruments): so-called be-

cause they do not appear on the l.h.s. of (1.16). This feature of the model is also termed the exclusion restriction. – What does this mean? z2i can only affect yi through x2i: it cannot have a direct effect on yi. – In economic terms, this is a very demanding requirement: which is why it is so difficult to find convincing instruments in practice.

  • Order condition: here equivalent to dz2 ≥ dx2; need at least as many excluded

instruments as endogenous regressors. 9

slide-14
SLIDE 14
  • Rank condition? Rewrite the first stage as
  • x1i

x2i

  • = ΠT
  • z1i

z2i

  • +
  • v1i

v2i

  • =
  • Idx1

ΠT

0,1

ΠT

0,2

x1i z2i

  • +
  • v2i
  • .

(1.17) Then IV-RANK′ holds iff dx1 + dx2 ≤ rk Π0 = dx1 + rk Π0,2 i.e. iff rk Π0,2 ≥ dx2. – (1.17) also indicates: the first dx1 first-stage equations hold without error: so ˆ x1i = x1i = ˜ x1i.

  • This framework allows us to highlight a common mistake in computing 2SLS: when

computing ˆ x2i, x2i must be regressed on both x1i and z2i. Regression of x2i on z2i alone yields an inconsistent estimator (see the problem set).

1.5 Another way of computing the 2SLS estimator

  • Here again are the structural equation and the first stage:

yi = xT

1iβ0,1 + xT 2iβ0,2 + ui

(1.18)

  • x1i

x2i

  • =
  • Idx1

ΠT

0,1

ΠT

0,2

x1i z2i

  • +
  • v2i
  • = ΠT

0 zi + vi

with Ezi(ui, v T

2i) = 0 (by IV-ORTH and the definition of Π0 as a matrix of population

regression coefficients).

  • Because Ex2iui = 0, OLS is inconsistent. In a certain sense, this can always be

regarded as a species of ‘omitted variable’ problem. – Why? x2i = ˜ x2i + v2i has two components: E˜ x2iui = 0 (by IV-ORTH), but Ev2iui = 0 (possibly) – If v2i could be included in the structural equation, correlation between x2i and the error term would disappear!

  • More formally, orthogonal decomposition (population regression) of ui yields:

ui = v T

2iρ0 + ǫi,

Ev2iǫi = 0 (1.19) 10

slide-15
SLIDE 15

and substituting (1.19) into (1.18) gives yi = xT

i β0 + v T 2iρ0 + ǫi.

(1.20)

  • Is (1.20) a valid regression equation? Yes, because Ev2iǫi = 0, and

Exiǫi = E(ΠT

0 zi + vi)ǫi = ΠT 0 Ezi(ui − v T 2iρ0) =(3) 0

where =(3) follows by Ezi(ui, v T

2i) = 0.

  • Can we therefore estimate β0 by OLS, using the augmented structural equation

(1.20)? – Infeasible, because v2i is unobserved. – Feasible version: regress yi on xi and ˆ v2i = x2i − ˆ x2i.

  • Yields a consistent estimator of β0 (and of ρ0), sometimes termed the control func-

tion estimator: the idea being that the endogeneity in x2i is ‘controlled for’ by the inclusion of v2i (or rather, ˆ v2i). – Actually, this procedure identically reproduces the 2SLS estimator (see the prob- lem set). But in more general (e.g. nonlinear) settings, these approaches yield different estimators (and, indeed, different identifying conditions). – ‘Usual’ OLS standard errors associated with the feasible versions of (1.20) are not suitable for inference, since these ignore the estimation of ˆ v2i.

1.6 Testing exogeneity of the endogenous regressors (Hausman test)

  • Preceding framework is useful, because ρ carries information about the extent of the

endogeneity in x2i. Note: Ex2iui = E(˜ x2i + v2i)ui =(1) Ev2i(v T

2iρ0 + ǫi) =(2) Ev2iv T 2iρ0

where =(1) follows by IV-ORTH, and =(2) by (1.19). Thus, supposing Ev2iv T

2i is full

rank, Ex2iui = 0 ⇐ ⇒ ρ0 = 0.

  • Having a consistent (and asymptotically normal) estimate of ρ0 – as a by-product of

computing the control function estimator – thus allows us to test H0 : Ex2iui = 0 against H1 : Ex2iui = 0 11

slide-16
SLIDE 16

by rephrasing this as a test of H′

0 : ρ0 = 0

against H′

1 : ρ0 = 0.

– Note that Ex1iui = 0 (and more generally, IV-ORTH) is maintained under both the null and the alternative – we shall discuss a test of this condition subsequently. – A test of H0 – which may be accomplished in various other ways – is often referred to as a ‘Hausman test’.

  • How? Regress yi on xi and ˆ

v2i, to obtain the control function estimate ˆ ρn; let ˆ Vn,ρ denote the usual (hetero-robust) OLS asymptotic variance estimator, associated to these parameters. – Under H′

0 (and only under H′ 0) ˆ

Vn,ρ delivers a valid estimate of the asymptotic variance of ˆ ρn (see the problem set). – Thus the appropriate Wald statistic has the usual limiting distribution, Wn := nˆ ρT

n ˆ

V −1

n,ρ ˆ

ρn

d

→ χ2[dx2]. – I.e. for testing this specific null, no adjustment needs to be made for the fact that ˆ v2i was estimated. This would not be the case if ρ0 = ρ∗, for some ρ∗ = 0, happened to be of interest. (For more general ‘exceptional’ cases of this kind, see Wooldridge, 2002, Sec. 6.1.)

  • Why might we want to do this?

– Under the null, all regressors are exogenous: β0 can be estimated by OLS. Since OLS may deliver much more precise estimates of 2SLS, we’d prefer to use the OLS estimates, if they were valid. Testing H0 can be viewed as testing the validity of OLS. – It is common empirical practice to report both OLS and 2SLS estimates, to- gether with the result (expressed as say, a p-value) of a test of H0. – One caveat here is that a failure to reject H0 could simply be due to the low power of the test: and the test will be less powerful with weaker instruments.

1.7 Testing the identifying conditions

  • In the linear IV model,

yi = xT

i β0 + ui

12

slide-17
SLIDE 17

with instruments zi, recall the two conditions that ensure β0 is identified:

IV-ORTH Eziui = 0; IV-RANK′ rk Π0 = dx and rk EzizT

i = dz.

We now turn to the question of how each of these conditions might be tested. 1.7.1 Tests of overidentifying restrictions (Sargan test)

  • We would like to test IV-ORTH, the empirical counterpart of which is

1 n

n

  • i=1

zi ˆ ui where ˆ ui = yi − x′

i ˆ

βn.

  • As we shall see, our ability to detect violations of IV-ORTH is somewhat limited.
  • By construction, the 2SLS estimator is the unique solution to the following sample
  • rthogonality condition,

0 = ˆ ΠT

n n

  • i=1

zi(yi − xT

i β)

  • If dz = dx, this reduces to

0 =

n

  • i=1

zi(yi − xT

i ˆ

βn) =

n

  • i=1

zi ˆ ui and no test of IV-ORTH is possible. (In exactly the same way, it is impossible to test the exogeneity of any r.h.s. variables in a regression model.)

  • If dz > dx: then some (limited) progress can be made. For then

0 =

n

  • i=1

zi(yi − xT

i β)

cannot (generally) be solved by any β: it is a system of dz equations in dx unknowns. – Evaluating the r.h.s. at the 2SLS estimator gives us a means of detecting some departures from IV-ORTH.

  • Because β0 must be estimated, there will be a dx-dimensional linear subspace Φ ⊆

Rdz, such that a test of H0 : Eziui = 0 against H1 : Eziui = 0 13

slide-18
SLIDE 18

will have no power against an alternative of the form Eziui = φ ∈ Φ. – (In other words, we are really testing H0 : Eziui ∈ Φ against H1 : Eziui / ∈ Φ.)

  • Regarding the test statistic, note that under IV-HMSK,

ξn :=

  • ˆ

σ2

u

n

n

  • i=1

zizT

i

−1/2 1 n1/2

n

  • i=1

ziui

d

→ N[0, Idz] (1.21) where A−1/2 denotes the positive definite square root of A−1 (see Appendix A.2); whence ξT

n ξn d

→ χ2[dz].

  • When ui is replaced by ˆ

ui – yielding ˆ ξn – (1.21) no longer holds; we obtain instead that ˆ ξn

d

→ P · N[0, Idz], (1.22) where P is an (orthogonal) projection matrix (i.e. a symmetric and idempotent mat- rix), with rk P = dz − dx. – In consequence of (1.22) and Lemma 1.1 below, ˆ ξT

n ˆ

ξn = n

  • i=1

zi ˆ ui T ˆ σ2

u n

  • i=1

zizT

i

−1 n

  • i=1

zi ˆ ui

  • d

→ χ2[dz − dx]. (1.23) – In the usual parlance, the estimation of β0 ‘absorbs’ dx degrees of freedom.

  • For (1.23) to hold: if IV-HMSK holds, then ˆ

ui must be computed using the 2SLS residuals; no other IV estimator (i.e. none corresponding to a different choice of L) will yield this limiting distribution. (In general, the limiting distribution will be much more complicated.)

  • If IV-HMSK fails, then some other estimator of β0 must be used; this estimator will

be discussed later, in the context of GMM.

  • This test is sometimes referred to as a Sargan test.

Lemma 1.1. Suppose η ∼ N[0, Ir], and P is an r × r (orthogonal) projection matrix with rk P = s ≤ r. Then ηTPη ∼ χ2[s]. Proof.

  • Since P is symmetric, it admits the decomposition P = CΛCT, where C is an
  • rthonormal matrix (C−1 = CT), and Λ a diagonal matrix whose entries correspond

to the eigenvalues of P (see Appendix A.2). 14

slide-19
SLIDE 19
  • Since it is also idempotent, with rank s, P has s eigenvalues equal to unity, and r −s

equal to 0. Thus we may choose C such that Λ =

  • Is
  • .
  • Hence, letting ζ := CTη,

ηTPη = ηTCΛCTη = ζT

  • Is
  • ζ =

s

  • i=1

ζs

i ∼(4) χ2[s]

where ∼(4) follows from ζ being normally distributed with var(ζ) = CT var(η)C = CTC = Ir. 1.7.2 Testing the rank condition

  • By contrast, we are not so limited in our ability to detect violations of the rank

condition.

  • The rank condition pertains only to the first stage,
  • x1i

x2i

  • =
  • Idx1

ΠT

1

ΠT

2

x1i z2i

  • +
  • v2i
  • ;

recall that IV-RANK′ is equivalent to rk Π2 = dx2. (To make the notation a little less cumbersome, drop the ‘0’ subscripts denoting the truth for the time being.)

  • Consider first the case where dx2 = 1. Then the only relevant equation is

x2i = πT

1 x1i + πT 2 z2i + v2i,

(1.24) where π1 ∈ Rdx1 and π2 ∈ Rdz2. – IV-RANK′ is here rk π2 = 1, which obtains iff π2 has at least one non-zero element. – A test of H0 : π2 = 0, and thus of IV-RANK′, can be carried out via the usual F (or Wald) test for joint significance, in the regression (1.24).

  • Now suppose dx2 = 2, so that we have two equations

x2,1i = πT

11x1i + πT 12z2i + v2,1i

(1.25) x2,2i = πT

21x1i + πT 22z2i + v2,2i.

(1.26) 15

slide-20
SLIDE 20
  • We could estimate each regression equation, and run F tests of H0 : π12 = 0

and H0 : π22 = 0. A failure to reject at least one of these nulls would indicate rk Π2 ≤ 1 < 2, and thus a violation of the rank condition. – F-statistics for each first-stage equation are thus a useful diagnostic, and often reported in applied work.

  • On the other hand, π12 = 0 and π22 = 0 do not imply that

Π2 =

  • π12

π22

  • has full rank: it could be the case, for example, that only the first instrument in z2i

has a nonzero coefficient, in both (1.25) and (1.26).

  • To test rk Π2 = dx2, we need an appropriate, multiple-equation generalisation of the

single-equation F test. Various tests of this kind exist; the Cragg-Donald test is widely used for this purpose.

  • Under IV-HMSK, the Cragg–Donald statistic for testing the rank of Π2 can be expressed

in terms of Wn := ˆ Σ−1/2

V V

ˆ ΠT

2

n

  • i=1

z2izT

2i

  • ˆ

Π2 ˆ Σ−1/2

V V

, where z2i denotes the residuals from regressing each element of z2i on x1i, and ˆ ΣV V := 1

n

n

i=1 ˆ

vi ˆ v T

i . This can be regarded as a kind of matrix analogue of a Wald

statistic. – Observe that n−1Wn

p

→ Σ−1/2

V V

ΠT

2

  • Ez2izT

2i

  • Π2Σ−1/2

V V

, and the r.h.s. is positive semi-definite, with rank equal to the rank of Π2. – The rank of a symmetric matrix is equal to the number of its eigenvalues that are non-zero. This suggests testing H0 : rk Π2 ≤ dx2 − k against H0 : rk Π2 > dx2 − k by comparing the k smallest eigenvalues of Wn with zero.

  • Most relevant, for our purposes, is the case where k = 1; it can be shown that

λmin(Wn)

d

→ χ2[(dz2 − dx2) + 1] 16

slide-21
SLIDE 21

under H0 : rk Π2 ≤ dx2 − 1, and diverges (to ∞) under the alternative (H1 : rk Π2 = dx2).

  • In the context of testing IV-RANK′, λmin(Wn) is often referred to as ‘the’ Cragg–Donald

statistic.

1.8 Weak instruments

1.8.1 The problem

  • When Π0 is such that the rank condition is ‘nearly’ violated, the instruments are said

to be weak.

  • Weak instruments are a serious problem, because they render standard methods for

conducting inference (such as the Wald statistics introduced earlier) invalid.

  • To understand how the problem arises, consider the case of a single endogenous

regressor, and a single instrument, yi = β0xi + ui xi = π0zi + vi. Then the rank condition holds, provided π0 = 0. Suppose also that IV-HMSK holds.

  • If π0 = 0, then

n1/2(ˆ βn − β0) = n−1/2 n

i=1 ziui

n−1 n

i=1 zixi d

→(2) N[0, σ2

uEz2 i ]

π0Ez2

i

∼ N[0, σ2

uπ−2 0 (Ez2 i )−1]

(1.27) where

d

→(2) follows from Slutsky’s theorem and 1 n

n

  • i=1

zixi = 1 n

n

  • i=1

zi(π0zi + vi)

p

→ Ezi(π0zi + vi) = π0Ez2

i

by the LLN.

  • If π0 = 0, then xi = vi, and so

ˆ βn − β0 = n

i=1 ziui

n

i=1 zixi

= n

i=1 ziui

n

i=1 zivi

. – Both numerator and denominator are sums of mean zero, i.i.d. random variables; 17

slide-22
SLIDE 22

the CLT gives 1 n1/2

n

  • i=1
  • ziui

zivi

  • d

  • η1

η2

  • ∼ N[0, Ω],

Ω =

  • σ2

u

σuv σuv σ2

v

  • Ez2

i .

– Thus, by Slutsky’s theorem, ˆ βn − β0 = n−1/2 n

i=1 ziui

n−1/2 n

i=1 zivi d

→ η1 η2 , a ratio of two (possibly correlated) normal variates: this is very different from the limiting distribution in (1.27).

  • The normal approximation to the distribution of n1/2(ˆ

βn − β0), which underpins inference based on the t and Wald statistics, thus breaks down completely when π0 = 0. – With a little more work, we can show that the same phenomenon affects the t-statistic: when π0 = 0, we have tn = ˆ σ−1

u

n

  • i=1

ˆ x2

i

1/2 (ˆ βn − β0)

d

→ N[0, 1]; (1.28) but when π0 = 0, the limiting distribution of tn is highly non-normal (and depends on parameters that cannot be consistently estimated).

  • We might hope to assume away this problem by ‘imposing’ the rank condition, i.e. by

simply assuming that π0 = 0 here. But unfortunately, approximate non-normality afflicts the t statistic even when π0 is small but nonzero – in which case the rank condition is satisfied – even in ‘large’ samples.

  • To illustrate the problem, let Gn(·; π0) denote the cdf of the finite sample distribution
  • f tn; the second argument indicates that this quantity depends on the value of π0.

– For each π0 ∈ R, we have that Gn(τ; π0) → G(τ; π0) = Φ(τ) for each τ ∈ R, where Φ denotes the standard the normal distribution, as appears on the r.h.s.

  • f (1.28). This provides the usual justification for drawing critical values for

the t-test from the quantiles of a standard normal distribution. – On the other hand, Gn(·; 0) remains highly non-normal for all n, regardless of how large: thus if π0 = 0, the use of normal critical values will deliver a test that is highly unreliable (in the sense that the actual null rejection rate may be a long way from nominal significance level of the test). 18

slide-23
SLIDE 23
  • Now suppose we fix a ‘moderately large’ n, and consider how Gn(·; π0) varies with

π0. For ‘large’ π0, the normal approximation to Gn(·; π0) works well; it is close to G(·; π0). – But as we reduce π0 closer to zero, Gn(·; π0) has to become more and more like Gn(·; 0): the normal approximation must break down, well before we reach π0 = 0. – The larger that n is, the closer to zero that we may take π0 before the normal approximation breaks down, but there will always be a range of nonzero π0’s – and thus models technically satisfying IV-RANK′ – for which it will not. For such π0’s, the t test will be almost as unreliable as it would be if π0 = 0!

  • To put this another way: while IV-RANK′ is sufficient for identification, it is not

sufficient to ensure that standard inferential procedures will be tolerably reliable, even in ‘reasonably large’ samples. 1.8.2 Dealing with weak instruments

  • In view of the preceding, it is not sufficient to know whether or not the rank condition

fails: we also need to know whether it ‘nearly’ fails.

  • Thus simply carrying out a test of the rank condition is not quite sufficient: to

continue the preceding example, we might correctly reject H0 : π0 = 0 – and yet the actual value of π0 may still be small enough to cause us difficulties.

  • We therefore need to modify the critical values used in these tests, so as to permit

non-rejection when Π0 is merely ‘close’ to having deficient rank, in the sense of that the normal approximation fails to hold (to within some specified tolerance). – This kind of reasoning led Staiger & Stock (1997) to recommend comparing the first-stage F statistic, in a model with only one endogenous regressor, with a critical value of 10. This compares with the critical value of 3.78 that would be used to test the rank condition, at the 1% level, if there were three available instruments. – For a model with multiple endogenous regressors, Stock & Yogo (2005) give modified critical values for the Cragg-Donald test.

  • Tests using these modified critical values can be regarded as providing a kind of ‘pre-

test’ for weak instruments. Thus a common way of dealing with weak instruments is to continue to use standard procedures to draw inferences about β0, but to also report the outcomes of these pre-tests. 19

slide-24
SLIDE 24
  • However, as the example from the preceding section illustrates, there is no clear

boundary demarcating ‘weak’ from ‘strong’ instruments: all that can be said is that the weaker the instruments, the less reliable that standard inferences will be.

  • Thus the most satisfying response to the weak instruments problem involves the

development of inferential procedures that are completely robust to weak instruments – being able to tolerate even a failure of IV-RANK′. The conceptually simplest of these is the Anderson–Rubin (1949) test. 1.8.3 The Anderson–Rubin (AR) test

  • We want to test the null that H0 : β2 = β∗

2 (against H1 : β2 = β∗ 2) in the model:

yi = xT

1iβ1 + xT 2iβ2 + ui

  • x1i

x2i

  • =
  • Idx1

ΠT

1

ΠT

2

x1i z2i

  • +
  • v2i
  • ,

where zi = (xT

1i, zT 2i)T are the available instruments.

  • Under the null, we have

yi(β∗

2) := yi − xT 2iβ∗ 2 = xT 1iβ1 + ui,

while under any alternative β2 = β∗

2, letting δ := β2 − β∗ 2

yi(β∗

2) = xT 1iβ1 + xT 2i(β2 − β∗ 2) + ui

= xT

1iβ1 + (ΠT 1 x1i + ΠT 2 z2i + v2i)Tδ + ui

= xT

1i(β1 + Π1δ) + zT 2i(Π2δ) + (v T 2iδ + ui)

=: xT

1iκ1 + z2iκ2 + ηi.

(1.29)

  • (1.29) is clearly a valid regression equation, in the sense that Ex1iηi = 0 and Ez2iηi =
  • 0. Because we know β∗

2 under the null, we can compute yi(β∗ 2), and thus we can

always consistently estimate κ1 and κ2 via an OLS regression of yi(β∗

2) on zi =

(xT

1i, zT 2i)T. Indeed, the usual t and Wald tests remain entirely valid here.

  • This fact is exceedingly useful, because we can rephrase the null of interest in terms
  • f the coefficient κ2.

– Under H0, δ = 0 and thus κ2 = 0. A rejection of H′

0 : κ2 = 0 therefore implies

a rejection of H0. 20

slide-25
SLIDE 25

– Under H1, δ = 0: if the rank condition holds, then rk Π2 = dx2, and κ2 = Π2δ = 0. On the other hand, if the rank condition fails, it could well be the case that κ2 = 0. This means that some violations of H0 cannot be detected.

  • The AR test of H0 : β2 = β∗

2 thus consists of testing κ2 = 0 in (1.29), using the

usual OLS-based Wald statistic.

  • The inability of this approach to detect departures from H0 when the rank condition

fails, makes perfect sense, because in this case β∗ is not identified. – Consider the extreme case where Π2 = 0. Then κ2 = 0, whatever the true value of β2 happens to be. Thus the Anderson–Rubin test of H0 : β2 = β∗

2

would reject with probability (approximately) equal to the significance level of the test, both under the null and under all possible alternatives.

  • Remarks:

(i) The test has good power properties when dx2 = dz2. However, when dz2 is much larger than dx2, the test is rather inefficient. This is because a test of H′

0 : κ2 = 0 involves dz2 restrictions, rather than merely dx2 restrictions. In this

case, such alternatives as Kleibergen’s (2002) LM test and Moreira’s (2003) CLR test will typically outperform the AR test. (ii) Supposing we conjecture the correct value for β2, a test of H′

0 : κ2 = 0 also

corresponds to a test of the exclusion restrictions (i.e. the restriction that z2i should not appear in the model). The AR test relies on us interpreting a rejec- tion of H′

0 as a rejection of H0, rather than as signifying that some instruments

have been incorrectly excluded from the structural equation. (iii) The argument above supposes that the null restricts all the elements of β2; whereas often we will only want to restrict a subset of these. The basic principle behind the AR test also extends to this case, but the proof of its validity is much more involved.

1.A Suggested (optional) further reading

  • The preceding mostly follows the treatment given in Wooldridge (2002, Ch. 5).
  • Other potentially useful references are Davidson and MacKinnon (2004, Ch. 8) and

Greene (2008, Ch. 12). 21

slide-26
SLIDE 26

22

slide-27
SLIDE 27

ECONOMETRICS 1, MT 2016 20/04/17

  • J. A. DUFFY

2 Generalised method of moments

2.1 Introduction

2.1.1 Motivating examples

  • Recall that in the linear IV model,

yi = xT

i β0 + ui

Eziui = 0 (2.1) β0 was identified as the unique solution to the following system (of dz equations), Ezi(yi − xT

i β) = 0,

which are also termed moment conditions (or estimating equations). – The sample analogue of these conditions, 1 n

n

  • i=1

zi(yi − xT

i β) ≈ 0

(2.2) then provided the basis for the estimation of β0. We have written ‘≈’ here, because in the general case (dz > dx) no value of β is capable of setting every equation in (2.2) to zero simultaneously.

  • Moment conditions naturally arise as identifying conditions in many other economet-

ric models. Example 2.1.

  • Suppose we were to generalise (2.1) to

yi = h(xi; β0) + ui Eziui = 0, where x → h(x; β) is a non-linear function parametrised by β.

  • In this case, we would aim to identify β0 as the solution to

Ezi[yi − h(xi; β)] = 0.

  • Whether or not this equation is uniquely solved at β = β0 will depend on the

parametrisation of h. 23

slide-28
SLIDE 28

Example 2.2.

  • Suppose that we observe a sample of individuals, each of whom receives a wage

ωi ∈ R, and who is observed to purchase quantities yi = (yi1, yi2) of two goods: ‘consumption’ (an aggregate of all consumption expenditures), and ‘leisure’.

  • In this case, the model supposes that the individuals have preferences given by some

utility function U(y; ξi, β0), where ξi is a random disturbance with a known distribution, assumed to be independ- ent of ωi, which allows for individual tastes to vary with unobservable characteristics. [In a more realistic example, we would also allow for U to depend on some observable characteristics xi.]

  • For example, we might specify that U has the Cobb-Douglas form,

U(y; ξ, β) = λ(β1 + β2ξ) log y1 + [1 − λ(β1 + β2ξ)] log y2, (2.3) where λ : R → [0, 1] is some function mapping from R to the unit interval; a common choice would be the logistic function, λ(x) := 1 1 + e−x (though many, many other choices are possible); we might specify ξ to be an i.i.d. N[0, 1].

  • Households choose y optimally by maximising U with respect to their budget con-

straint, y1 ≤ ωi(T − y2) where T − y2 gives the total hours worked during the week, when y2 hours of leisure are taken. Solving this constrained optimisation problem yields the ith household’s

  • ptimal (Marshallian) demands,

y ∗

i (β) := y ∗(ωi; ξi, β) =

  • y ∗

1(ωi; ξi, β)

y ∗

2(ωi; ξi, β)

  • =(1)
  • λ(β1 + β2ξi)ωiT

[1 − λ(β1 + β2ξi)]T

  • where =(1) follows from the Cobb-Douglas form given in (2.3).
  • We shall maintain that the preceding model is correctly specified, in the sense that

the observed data on (yi, ωi) was generated by a population of households with preference parameters β0. How could we identify β0? 24

slide-29
SLIDE 29
  • One possibility is through the moment conditions implied by the model. Consider,

for example m(y, ω) := (y1, y2, y 2

1 , y 2 2 , y1y2, y1ω, y2ω)T.

Then Em(yi, ωi) is a vector of population moments associated with the data (means, second moments and cross-products): this will agree with Em[y ∗

i (β), ωi], if (and

hopefully, only if) the latter is evaluated at β = β0.

  • In this manner, we arrive at the identifying moment condition

0 = E{m(yi, ωi) − m[y ∗

i (β), ωi]} = E[m(yi, ωi) − µ(ωi, β)],

(2.4) where we have defined µ(ωi, β) := E{m[y ∗(ωi; ξi, β)] | ωi} =(1) 1 y ∗(ωi; ξ, β)φ(ξ) dξ where =(1) holds, for φ(·) the standard normal density, if ξ is assumed to have a N[0, 1] distribution.

  • We have rewritten the identifying conditions (2.4) in terms of µ(ωi, β) rather than

m[y ∗

i (β), ωi], because the latter depends on ξi, which is not observed.

2.1.2 A general framework

  • We shall now abstract from these examples to a more general setting:

– We observe a sample {wi}n

i=1 of Rdw-valued i.i.d. random vectors.

– The process generating this data is described by a model parametrised by θ ∈ Θ ⊆ Rdθ; the true value of this parameter is θ0. – There is function g :Rdw × Θ → Rdg (w, θ) → g(w; θ) such that

GMM-ID θ = θ0 is the unique solution to:

Eg(wi, θ) = 0. (2.5) We shall discuss primitive conditions for GMM-ID subsequently. 25

slide-30
SLIDE 30

– Notions of ‘exact identification’ and ‘over-identification’ generalise readily from the IV model: here the analogue of the order condition for identification is dg ≥ dθ, and dg − dθ measures the degree of over-identification.

  • You may easily verify that both preceding examples fit into this framework:

– Example 2.1: wi = (yi, xi, zi), and g(w; β) = z[y − h(x; β)]; – Example 2.2: wi = (yi, ωi), and g(w, β) = m(y, ω) − µ(ω, β).

  • If we could evaluate the function Eg(wi, θ) for each θ ∈ Θ, then we could recover

θ0 by solving (2.5). – In practice, this is infeasible, since the distribution of wi is unknown. – However, as discussed in more detail below, the LLN ensures that the sample counterpart gn(θ) := 1 n

n

  • i=1

g(wi, θ) delivers a reasonable approximation to Eg(wi, θ), at each θ ∈ Θ. – Accordingly, the θ that solves gn(θ) = 0, (2.6) should also to be ‘close’ to the θ that solves (2.5), i.e. θ0. But it is only possible to find a solution to (2.6) (and, if the equations are nonlinear, not necessarily even then), in the exactly identified case (dg = dθ).

  • Since gn(θ) cannot be set exactly equal to zero, we should instead choose θ so as to

make it as close as possible to zero: this requires an appropriate measure of distance. – A particularly tractable way of measuring this is distance is provided by Qn(θ) := gn(θ)TWngn(θ), (2.7) where Wn is a dg × dg positive semi-definite weight matrix, which may depend

  • n the data (see below).

– Alternatively: we might take dθ linear combinations of the conditions (2.6), thus reducing an overidentified system to an exactly identified one. This was how we dealt with over-identification in the linear IV model, by reducing the sample identifying conditions from (1.9) to (1.10). 26

slide-31
SLIDE 31

– As we shall see, the first-order conditions characterising the minimiser of (2.7) take exactly the form of dθ linear combinations of the sample moment condi- tions, and in this sense these two approaches are equivalent.

  • Qn is termed the generalised method of moments (GMM) criterion function; the

GMM estimator is defined as ˆ θn := argmin

θ∈Θ

Qn(θ).

  • Different choices of weight matrix yield different estimators.

– The choice of weight matrix can be regarded generalising the choice of instru- ments from the linear IV model. – We might write ˆ θn(Wn) to make the dependence of the estimator on the weight matrix explicit. If Wn

p

→ W and Vn

p

→ V are distinct, where W and V are both positive definite, then in general ˆ θn(Wn) = ˆ θn(Vn), but – as we shall see below – both will converge in probability to θ0.

2.2 Asymptotics

  • Recall that:

– The identifying moment conditions are: Eg(wi, θ) = 0 a system of dg ≥ dθ equations, uniquely solved at θ = θ0 (GMM-ID). – The GMM estimator is given by ˆ θn := argmin

θ∈Θ

Qn(θ) = argmin

θ∈Θ

gn(θ)TWngn(θ). (2.8)

  • For the weight matrix, we shall require:

GMM-WGHT Wn is positive semi-definite, and Wn

p

→ W, for W positive definite. 2.2.1 Consistency

  • Previously, when having to show consistency of OLS or 2SLS, we solved (2.8) to
  • btain an explicit expression for ˆ

θn – to which we then applied the LLN and Slutsky’s

  • theorem. But in general, (2.8) can only be solved numerically; all we know about ˆ

θn is that it minimises Qn. 27

slide-32
SLIDE 32
  • The limiting behaviour of ˆ

θn must therefore be inferred from that of Qn. How? Observe that gn(θ) = 1 n

n

  • i=1

g(wi, θ) and for each θ ∈ Θ, {g(wi, θ)}n

i=1 is just a collection of i.i.d. random variables. Thus

by the LLN (provided E|g(wi, θ)| < ∞), gn(θ)

p

→ Eg(wi, θ) =: g0(θ). Hence, Qn(θ) = gn(θ)TWngn(θ)

p

→ g0(θ)TWg0(θ) =: Q(θ). (2.9)

  • Now it seems reasonable to expect that, since ˆ

θn minimises Qn, and Qn(θ)

p

→ Q(θ) for each θ ∈ Θ, then ˆ θn should converge (in probability) to the corresponding min- imiser of Q. – This will indeed be the case, under reasonable conditions. [Technically, we also need to show that Qn converges to Q uniformly in probability, which is a stronger form of convergence than is given in (2.9). This sort of convergence is discussed in detail in the second-year course.] – So we need to determine the minimiser of Q: since W is positive definite, Q(θ)    > 0 if g0(θ) = 0, = 0 if g0(θ) = 0; and thus, under GMM-ID Q is uniquely minimised at θ0. – In this manner, the consistency of ˆ θn for θ0 follows from Qn being ‘consistent’ for Q, which has a unique minimum at θ0. – Observe that the actual value of W plays no role here: all that matters is that it be positive definite. 2.2.2 Asymptotic normality

  • Having established the consistency of ˆ

θn, its asymptotic normality may be deduced via the following linearisation argument.

  • We first note that, for any differentiable function h : Θ → Rdh,

∇θ[h(θ)TAh(θ)] = [Dθh(θ)]T(A + AT)h(θ), (2.10) 28

slide-33
SLIDE 33

where for any f : Θ → R, ∇θf (θ) :=       ∂1f (θ) ∂2f (θ) . . . ∂dθf (θ)       Dθh(θ) :=       ∂1h1(θ) ∂2h1(θ) · · · ∂dθh1(θ) ∂1h2(θ) ∂2h2(θ) · · · ∂dθh2(θ) . . . . . . ... . . . ∂1hdh(θ) ∂2hdh(θ) · · · ∂dθhdh(θ)       , i.e. the gradient of f and the Jacobian of h, respectively; here ∂kf (θ) denotes the partial derivative of f with respect to the kth element of θ = (θ1, . . . , θdθ)T.

  • Assuming g(wi, θ) – and thus gn(θ) – is differentiable in θ, the minimiser ˆ

θn of Qn(θ) = gn(θ)TWngn(θ) must satisfy the first-order condition (FOC) for an interior minimum, 0 = ∇θQn(ˆ θn) =(2) 2[Dθgn(ˆ θn)]TWngn(ˆ θn), (2.11) where =(2) follows by (2.10). – Note that this condition holds only if ˆ θn is not on the boundary of Θ. Recalling that the interior of a set Θ, denoted int Θ, is defined as Θ less its boundary points, we shall want to assume that

INTR θ0 ∈ int Θ.

Why? By INTR, there must exist an ǫ > 0 such that B(θ0, ǫ) := {θ ∈ Θ : θ − θ0 < ǫ} is wholly contained in Θ. By consistency P{ˆ θn ∈ int Θ} ≥ P{ˆ θn ∈ B(θ0, ǫ)} → 1, whence the FOC (2.11) will hold with probability approaching 1 (so far as these asymptotic arguments are concerned, this essentially permits us to proceed as though (2.11) always holds).

  • Under certain regularity conditions [discussed in the second-year course], the LLN

delivers Dθgn(θ) = 1 n

n

  • i=1

Dθg(wi, θ)

p

→ EDθg(wi, θ) = DθEg(wi, θ) = Dθg0(θ) (2.12) i.e. the Jacobian of the sample moments converges to its population counterpart. 29

slide-34
SLIDE 34

By consistency, therefore [together with some additional conditions] Dn,1 := Dθgn(ˆ θn)

p

→ Dθg0(θ0) =: D.

  • It will also be true that gn(ˆ

θn)

p

→ g0(θ0) = 0, but this is not very helpful. Far more useful is the fact that n1/2gn(θ0) = 1 n1/2

n

  • i=1

g(wi, θ0)

d

→ N[0, S] where S := Eg(wi, θ0)g(wi, θ0)T, by the CLT (supposing Eg(wi, θ0)2 < ∞).

  • By consistency, gn(ˆ

θn) should be close to gn(θ0), and indeed a kind of mean-value expansion here yields gn(ˆ θn) = gn(θ0) + Dθgn(˜ θn)(ˆ θn − θ0), (2.13) where ˜ θn lies on the ray connecting ˆ θn to θ0. [Actually, to make this rigorous, each row of Dθgn would have to be evaluated at a (possibly) different point along that ray, but this is irrelevant for the asymptotics.] – Note that a mean-value expansion of gn between the points θ0 and ˆ θn is only necessarily valid if all the points lying between θ0 and ˆ θn also lie within Θ. INTR and consistency will again take care of this for us, since P{ˆ θn ∈ B(θ0, ǫ)} → 1 and B(θ0, ǫ) ⊆ Θ. – Since ˜ θn lies between θ0 and ˆ θn, the consistency of ˆ θn ensures that ˆ θn

p

→ θ0. It is accordingly reasonable to expect, on the basis of (2.12), that Dn,2 := Dθgn(˜ θn)

p

→ Dθg0(θ0) =: D. (2.14) [This will indeed follow once the convergence in (2.12) is suitably strengthened to uniform convergence in probability.]

  • Substituting (2.13) into (2.11) yields

0 = DT

n,1Wn[gn(θ0) + Dn,2(ˆ

θn − θ0)] whence n1/2(ˆ θn − θ0) = −(DT

n,1WnDn,2)−1DT n,1Wnn1/2gn(θ0)

(2.15)

d

→ (DTWD)−1DTW · N[0, S]. 30

slide-35
SLIDE 35
  • (2.15) only makes sense if DTWD has full rank; under GMM-WGHT, it suffices that

GMM-JAC rk D = rk DθEg(wi, θ0) = dθ.

  • Although it is not strictly necessary, it is also convenient to assume that none of

the moment conditions are redundant, in the sense of being expressible as linear combinations of any of the others; this is encoded in the requirement that

GMM-VAR S := Eg(wi, θ0)g(wi, θ0)T is positive definite.

Theorem 2.1. Under INTR, GMM-ID, GMM-WGHT, GMM-JAC, GMM-VAR and further regularity conditions [discussed in the second-year course], the GMM estimator ˆ θn has n1/2(ˆ θn−θ0)

d

→ N[0, VW], where VW = (DTWD)−1DTWSWD(DTWD)−1.

  • Remarks:

(i) A consistent estimator of VW can be constructed by replacing W by Wn, and D and S by ˆ Dn := Dθgn(ˆ θn) = 1 n

n

  • i=1

Dθg(wi, ˆ θn) ˆ Sn := 1 n

n

  • i=1

g(wi, ˆ θn)g(wi, ˆ θn)T − gn(ˆ θn)gn(ˆ θn)T [Consistency of these estimators is covered in the second-year course.] (ii) If we regard Wn as an estimator of W, it is clear from the proof that the variability of Wn has no (first-order) effect on the limiting distribution of ˆ θn. This generalises our earlier observation concerning 2SLS: recall the variability

  • f ˆ

Πn made no contribution to the limiting variance of the 2SLS estimator. 2.2.3 Local identification and weak identification

  • When GMM-ID holds, Eg(wi, θ) = 0 has a unique solution at θ = θ0, and we say that

θ0 is globally identified.

  • When GMM-JAC holds, D = DθEg(wi, θ0) has rank dθ: this is sufficient for θ0 to be

locally identified. This means that there is an ǫ > 0 such that Eg(wi, θ′) = 0 for all θ′ = θ0 with θ′ − θ0 < ǫ, i.e. there are no other solutions to the identifying moment conditions lying arbitrarily close to θ0. 31

slide-36
SLIDE 36

– In the the linear IV model, gn(β) = 1

n

n

i=1 zi(yi − xT i β), and so

Dβgn(β) = 1 n

n

  • i=1

zixT

i

= ⇒ D = EzixT

i

GMM-JAC is thus the non-linear GMM counterpart of IV-RANK.

– In general, GMM-JAC is sufficient for global identification (as it is in the linear IV model) iff the moment conditions are linear in θ.

  • The GMM analogue of ‘weak instruments’ thus corresponds to the case where GMM-

JAC holds, but D is ‘close’ to having rank strictly less than dθ; we say that θ0 is

weakly identified. – When parameters are weakly identified, standard inferential procedures (based

  • n Theorem 2.1) cease to be reliable, and other, non-identification robust meth-
  • ds must be used. (For example, the Anderson–Rubin test can be generalised

to the present setting.)

2.3 Asymptotic efficiency

2.3.1 The choice of weight matrix

  • In view of the dependence of the limiting variance of ˆ

θn on W, it is natural that we should want to choose W so as to make this variance as small as possible.

  • It turns out (see the problem set), that setting W = S−1 yields an estimator that is

efficient, in the sense that the difference VW − VS−1 is positive semi-definite, for all (positive definite) W. Moreover, when W = S−1, the asymptotic variance simplifies to VS−1 = (DTS−1D)−1. – Observe that W = S−1 will give greatest weight to those moment conditions which have the smallest variance, so it is perhaps not so surprising that this choice leads to an estimator with desirable properties. – Any estimator for which Wn

p

→ S−1 is termed an efficient GMM estimator.

  • The major difficulty is actually realising this choice of weight matrix, which requires

32

slide-37
SLIDE 37

(essentially) a consistent estimator for S. The standard way of dealing with this problem is to use a two-step estimator: (i) Compute ˆ θn,1 as the minimiser of gn(θ)TW0gn(θ), where W0 is some ‘conveni- ent’ choice of positive-definite weight matrix. This estimator is necessarily consistent, though not efficient. (ii) Use ˆ θn,1 to compute ˆ Sn = 1

n

n

i=1 g(wi, ˆ

θn,1)g(wi, ˆ θn,1)T − gn(ˆ θn,1)gn(ˆ θn,1)T. (iii) Finally, compute ˆ θn,2 as the minimiser of gn(θ)T ˆ S−1

n gn(θ); this is asymptotically

efficient, since ˆ S−1

n p

→ S−1. – While this approach is still widespread, it has to some extent fallen out of favour in recent years. One problem is that ˆ Sn, although consistent, may not be a very good estimator of S. Standard methods of conducting inference (based on Theorem 2.1) fail to take account of this, and so may be unreliable. – (Generalised) empirical likelihood estimators offer an alternative to GMM that circumvents this problem, by producing an efficient estimator of S simultan- eously with solving the ( ˆ S−1

n -weighted) sample moment conditions. This real-

ises an estimator that shares the same limiting distribution as the efficient GMM estimator, and which has better higher-order properties. 2.3.2 The implied (efficient) choice of moments

  • As we noted above, an alternative to minimising Qn(θ) = gn(θ)TWngn(θ) would be

to take dθ linear combinations of the gn(θ) equations, yielding an estimator that corresponds to the exact solution to Angn(θ) = 0, (2.16) for some dθ × dg matrix An.

  • (2.16) may be compared with the FOC solved by the GMM estimator,

[Dθgn(θ)]TWngn(θ) = 0, which equivalently solves [Dθgn(ˆ θn)]TWngn(θ) = 0, 33

slide-38
SLIDE 38

and which is asymptotically equivalent to (in the sense of having the same limiting distribution) an estimator that solves DTWngn(θ) = 0 where D := EDθg(wi, θ0) (see the problem set).

  • Setting Wn = S−1, the matrix DTS−1 provides the optimal dθ linear combinations of

the dg moments – optimal in the sense that the solution to DTS−1gn(θ) = 0 yields an asymptotically efficient estimator. Again, a problem with realising this choice is that both D and S must be estimated.

  • For this reason, the estimation problem is rarely approached in this way (estimating

S causes enough problems as it is). An exception arises when the moment conditions are linear in θ, because then Dθgn(θ) does not depend of θ, as we shall now discuss in the context of the linear IV model. 2.3.3 Efficiency in the linear IV model

  • Recall that in this model, the sample moments are

gn(β) = 1 n

n

  • i=1

g(wi, β) = 1 n

n

  • i=1

zi(yi − xT

i β),

so that, in particular, g(wi, β0) = ziui. Thus S = Eg(wi, β0)g(wi, β0)T = Eu2

i zizT i .

  • Suppose that IV-HMSK holds. Then the preceding simplifies to σ2

uEzizT i , which sug-

gests that W ∗

n = ˆ

σ−2

u

  • 1

n

n

  • i=1

zizT

i

−1 is an efficient choice for a weight matrix. – Since any weight matrix proportional to (n

i=1 zizT i )−1 will give the same es-

timator, in this special case the efficient GMM estimator may be calculated in

  • ne step (there is no need to estimate σ2

u).

34

slide-39
SLIDE 39

– Indeed, in this case the efficient GMM estimator corresponds to 2SLS: this can be seen most readily from the FOC solved by the GMM estimator, which here takes the form [Dβgn(β)]TW ∗

n gn(β) = 0

⇐ ⇒

(1) ˆ

σ−2

u

1 n

n

  • i=1

xizT

i

  • 1

n

n

  • i=1

zizT

i

−1 1 n

n

  • i=1

zi(yi − xT

i β) = 0

⇐ ⇒ XTZ(ZTZ)−1

n

  • i=1

zi(yi − xT

i β) = 0

⇐ ⇒ ˆ ΠT

n n

  • i=1

zi(yi − xT

i β) = 0.

where ⇐ ⇒

(1) follows from Dβgn(β) = 1 n

n

i=1 zixT i = 1 nZTX.

  • In the more general, heteroskedastic case, an optimal choice of weight matrix is

W ∗

n =

  • 1

n

n

  • i=1

ˆ u2

i zizT i

−1 , where ˆ ui = yi − xT

i ˆ

βn,1 for some consistent initial estimator ˆ βn,1: the efficient estim- ator can only be realised as a two-step estimator in this case.

2.4 Tests of over-identifying restrictions

  • Just as in the linear IV model, it is possible to test the identifying orthogonality

conditions Eziui (IV-ORTH), in the more general setting of a model with parameters identified by (nonlinear) moment conditions, it is possible to test H0 : Eg(wi, θ0) = 0 against H1 : Eg(wi, θ0) = 0. This is termed a (Hansen) test of over-identifying restrictions. – Because of the need to estimate θ0, the power of this test will be severely curtailed in some directions – and will have no power at all in the exactly identified case (dg = dθ). – Provided that an efficient weight matrix is used, the test can be carried out using the GMM criterion. (This explains our earlier remark, in the context of conducting such a test on the the linear IV model, that ˆ ui could be computed using the 2SLS estimator if and only if IV-HMSK held.) 35

slide-40
SLIDE 40

Theorem 2.2. Under GMM-ID, GMM-WGHT, GMM-JAC, GMM-VAR and further regularity con- ditions, ngn(ˆ θn)T ˆ S−1

n gn(ˆ

θn)

d

→ χ2[dg − dθ] Proof.

  • The test statistic is a quadratic form in gn(ˆ

θn): this suggests that we should proceed by first determining the limiting behaviour of ˆ S−1/2

n

n1/2gn(ˆ θn).

  • In deriving the limiting distribution of 2SLS, we noted above (see (2.13) and (2.14))

that by a kind of mean-value expansion, n1/2gn(ˆ θn) = n1/2gn(θ0) + Dn,2n1/2(ˆ θn − θ0), where Dn,2

p

→ Dθg0(θ0), for g0(θ) := Eg(wi, θ).

  • Further, by the expression (2.15) that we obtained for n1/2(ˆ

θn − θ0), ˆ S−1/2

n

n1/2gn(ˆ θn) = ˆ S−1/2

n

[Idg − Dn,2(DT

n,1 ˆ

S−1

n Dn,2)−1DT n,1 ˆ

S−1

n ]n1/2gn(θ0)

= [Idg − ˆ S−1/2

n

Dn,2(DT

n,1 ˆ

S−1

n Dn,2)−1DT n,1 ˆ

S−1/2

n

] ˆ S−1/2

n

n1/2gn(θ0)

d

→ [Idg − S−1/2D(DTS−1D)−1DTS−1/2] · ξ, (2.17) where ξ ∼ N[0, Idg].

  • Letting H := S−1/2D, which has rank dθ, we recognise this as being equal to

[Idg − H(HTH)−1HT]ξ = (Idg − PH)ξ = P ⊥

H ξ,

where P ⊥

H denotes the orthogonal projection onto the dg −dθ subspace orthogonal to

the span of H (equivalently, the matrix which gives the residuals from the projection

  • nto H); P ⊥

H is thus a rank dg − dθ matrix.

  • Hence, by the preceding and Lemma 1.1,

ngn(ˆ θn)T ˆ S−1

n gn(ˆ

θn)

d

→ ξTP ⊥

H ξ ∼ χ2[dg − dθ].

  • Remarks:

(i) If the efficient GMM estimator were not used, the limiting distribution of the test would not be pivotal, but would instead depend on D, S, and W. (ii) The test for overidentifying restrictions in the homoskedastic linear IV model, discussed in section 1.7.1 above, corresponds exactly to the GMM criterion- function based test in that setting. It is thus also possible to conduct exactly 36

slide-41
SLIDE 41

such a test when IV-HMSK fails, provided that the efficient GMM estimator is used to construct the residuals ˆ ui; the limiting distribution remains χ2[dz − dx]. (iii) Just as in the linear IV model, the GMM test for overidentifying restrictions is clearly blind to certain departures from the null hypothesis. In effect, (2.17) says that ˆ S−1/2

n

n1/2gn(ˆ θn) ≈ P ⊥

H S−1/2n1/2gn(θ0),

(2.18) and we can interpret this as implying that the test will have no power against alternatives of the form Eg(wi, θ0) = δ, where P ⊥

H S−1/2δ = 0

at least so long as δ is ‘small’. [Since the approximation (2.18) holds only when Eg(wi, θ0) is ‘close’ to zero, it says nothing about the power of the tests against ‘large’ departures from the null – though this will be similarly limited.]

2.5 Hypothesis testing

  • Recall that by Theorem 2.1, n1/2(ˆ

θn − θ0)

d

→ N[0, VW], where VW = (DTWD)−1DTWSWD(DTWD)−1, and we discussed how VW could be consistently estimated by ˆ Vn.

  • This result allows us to conduct Wald tests of linear restrictions in the same way as

before; it does not matter whether or not the efficient GMM estimator is used (so long as the correct variance estimator is used). – To reiterate, consider a test of dr linear restrictions of the form H0 : Rθ0 = ρ against H1 : Rθ0 = ρ where R is a dr × dθ matrix having rank dr. – Then, exactly as for 2SLS, the Wald statistic behaves as Wn := n(Rˆ θn − ρ)T(Rˆ VnRT)−1(Rˆ θn − ρ)

d

→ χ2[dr] under H0 (and diverges to ∞ under H1). 37

slide-42
SLIDE 42

2.5.1 Tests of nonlinear restrictions and the delta method

  • With the aid of a result known as the delta method, it is also possible to test nonlinear

restrictions of the form H0 : r(θ0) = ρ against H1 : r(θ0) = ρ where r : Θ → Rdr is smooth, in the sense of being at least differentiable. Lemma 2.1 (delta method). Suppose (i) h : Θ → Rdr is differentiable at θ0; and (ii) sn(˜ θn − θ0)

d

→ ξ, for some sn → ∞. Then, for D := Dθh(θ0), sn[h(˜ θn) − h(θ0)]

d

→ Dξ. Proof.

  • Recall that a function is differentiable at a point θ0 if it can be locally approximated

there by a linear function.

  • More precisely, defining u : Rdθ → Rdh by

u(δ) :=      [h(θ0 + δ) − h(θ0)] − Dδ δ if δ = 0, if δ = 0; we note (i) implies that u(δ) → 0 as δ → 0, whence u is continuous at zero. Rewrite the preceding as h(θ0 + δ) − h(θ0) = Dδ + u(δ)δ.

  • Let ˜

δn := ˜ θn − θ0, so that ˜ δn

p

→ 0 and sn˜ δn

d

→ ξ by (ii). Then sn[h(˜ θn) − h(θ0)] = sn[h(θ0 + ˜ δn) − h(θ0)] = D[sn˜ δn] + u(˜ δn)sn˜ δn

d

→(3) Dξ where

d

→(3) follows by Slutsky’s theorem, noting in particular that u(˜ δn)

p

→ 0.

  • How can this result be used for testing H0 : r(θ0) = ρ? Recalling n1/2(ˆ

θn − θ0)

d

→ N[0, VW] and letting R := Dθr(θ0), we have immediately that n1/2[r(ˆ θn) − r(θ0)]

d

→ R · N[0, VW] = N[0, RVWRT]. 38

slide-43
SLIDE 43

– R is a dr × dθ matrix, assumed to have rank dr; so RVWRT is a dr × dr matrix having rank dr. – Under H0, we can replace r(θ0) by ρ, and thus by Slutsky’s theorem Wn = n[r(ˆ θn) − ρ]T(Rˆ VnRT)−1[r(ˆ θn) − ρ]

d

→ χ2[dr], (2.19) since ˆ Vn

p

→ VW, and RVWRT has full rank by assumption. – It remains to estimate R, which may not be known. (It would be enough that R is known under H0, which is sometimes the case.) If r is continuously differentiable, then ˆ Rn := Dθr(ˆ θn) will be consistent for R; and it is immediate from consistency that Wn = n[r(ˆ θn) − ρ]T( ˆ Rn ˆ Vn ˆ RT

n )−1[r(ˆ

θn) − ρ]

d

→ χ2[dr]. (2.20)

  • Remarks:

(i) What if RVWRT is rank deficient? Then the appeal to Slutsky’s theorem that underpins (2.19) can no longer be justified. To illustrate this more clearly, let xn := n1/2[r(ˆ θn) − ρ], An := Rˆ VnRT and note that the Wald statistic can be written as Wn = h(xn, An) := xT

n A−1 n xn.

The function h(x, A) = xTA−1x is only continuous at an invertible matrix A, but we have An

p

→ RVWRT which is rank deficient, and therefore not invertible. Consequently, (2.19) no longer holds. (ii) Nonlinear Wald tests are open to ‘abuse’ of the following kind. Suppose we want to test H0 : θ0 = 0, which involves dθ restrictions, and so would give a χ2[dθ] test. An apparently equivalent way of formulating the null is H′

0 : r(θ0) := dθ

  • k=1

θ2

0,k = 0

which seems to involve only one nonlinear restriction! So we might be tempted to construct a test statistic involving r(ˆ θn), as in (2.19) or (2.20), with the expectation that we will now get a χ2[1] test. This would fail, because R = Dθr(θ0) = (2θ0,1, . . . , 2θ0,dθ) = (0, . . . , 0), is a rank zero vector, so the reasoning that led to (2.19) or (2.20) no longer 39

slide-44
SLIDE 44

applies. (iii) More generally, if your restrictions are formulated in such a way that such that Dθr(θ0) has reduced rank, this is almost surely a sign that you have formu- lated your restrictions in an inappropriate way – and that these need to be reformulated before a Wald test can be successfully carried out. (iv) Wald tests of non-linear restrictions are not invariant to how the null hypothesis is formulated. Suppose, for example, that dθ = 1, and we are interested in testing H0 : θ0 = 1; this gives the Wald statistic Wn = n(ˆ θn − 1)2 ˆ vn

d

→ χ2[1]. Alternatively, we could have phrased the null as H′

0 : θ3 0 = 1; this gives the Wald

statistic W ′

n = n[ˆ

θ3

n − 1]

ˆ r 2

n ˆ

vn

d

→ χ2[1] where ˆ rn = 3ˆ θ2

  • n. (Or, since θ0 = 1 under the null, ˆ

rn = 3 would also give a valid test.) While Wn and W ′

n have the same limiting distribution, they may differ

appreciably in a finite sample – and thus lead to possibly different accept/reject decisions. (v) For similar reasons, Wald tests are not invariant to a reparametrisation of the model either. That is: suppose the model is parametrised by θ ∈ Θ, and we test H0 : r(θ) = ρ (now possibly a linear restriction) using a Wald statistic. Equivalently, suppose we reparametrise the model in terms of γ = ϕ(θ), and then test H′

0 : r(ϕ−1(γ)) = ρ; the Wald statistic computed in this case may

again lead to a different accept/reject decision (in a finite sample). Example 2.3.

  • Returning to the setting of Example 2.2, recall that the expenditure share parameter

in that model is given by λi := λ(β1 + β2ξi) = [1 + exp(−β1 − β2ξi)]−1. Because ξi ∼i.i.d. N[0, 1], this varies randomly (and unobservably) across households, and so we would generally want to test hypotheses about the distribution of λi.

  • To give a concrete example, suppose we want to test the hypothesis that the mean

value of λi is 0.5, that is Eλi = 0.5, (2.21) 40

slide-45
SLIDE 45

which can be equivalently stated as r(β) :=

  • R

λ(β1 + β2ξ)φ(ξ) dξ = 0.5, (2.22) where φ denotes the standard normal density.

  • r(β) is clearly a continuously differentiable, nonlinear function of the parameters.

Although no closed form expression for the integral in (2.22) exists, it – and its derivatives with respect to β – can be evaluated numerically without difficulty.

  • A Wald test of (2.21) could accordingly be based upon

n[r(ˆ βn) − 0.5]2 ˆ Rn ˆ Vn ˆ RT

n d

→ χ2[1] where r(·) is defined as in (2.22), ˆ βn denotes a GMM estimator of β0, ˆ Vn is a consistent estimate of its limiting variance, and ˆ Rn :=

  • R λ′(ˆ

βn,1 + ˆ βn,2ξ)φ(ξ) dξ

  • R λ′(ˆ

βn,1 + ˆ βn,2ξ)ξφ(ξ) dξ

  • for

λ′(x) := ∂ ∂x λ(x) = ex 1 + ex . 2.5.2 GMM criterion-based tests (QLR tests)

  • One way of avoiding the lack of invariance of Wald tests is to use a test based on

the GMM criterion function (sometimes termed a ‘quasi-likelihood ratio’, or QLR test).

  • Suppose, as above, that we are interested in testing

H0 : r(θ0) = ρ against H1 : r(θ0) = ρ.

  • Whereas a Wald test evaluates H0 by comparing r(ˆ

θn) with ρ, a QLR test evaluates H0 by considering the extent to which the ‘fit’ of the model – as measured by the GMM criterion – deteriorates when H0 is imposed.

  • To state the QLR test statistic:

– Recall the GMM criterion is Qn(θ; Wn) = gn(θ)TWngn(θ), 41

slide-46
SLIDE 46

and let ˆ Sn denote a consistent estimator of S := Eg(wi, θ0)g(wi, θ0)T; this can be computed with the aid of an initially consistent estimator of the unrestricted model. – Let Θρ := {θ ∈ Θ | r(θ) = ρ} denote the subset of the parameter space that is consistent with the null.

  • The QLR test statistic is defined as

QLRn := min

θ∈Θρ Qn(θ; ˆ

S−1

n ) − min θ∈Θ Qn(θ; ˆ

S−1

n ) d

→ χ2[dr] where dr is the rank of Dθr(θ0) (assumed, as for the Wald test, to be equal to the number of restrictions under test). – Observe that the test statistic is necessarily non-negative, since the constrained minimum of Qn is always (weakly) greater than its unconstrained minimum.

  • Remarks:

(i) The QLR test has better invariance properties than the Wald statistic, since it only depends on the minimised values of the GMM criterion function, with and without the null imposed. Modulo the computation of ˆ Sn, these values are invariant both to how the model is parametrised, and to how the restrictions are formulated (indeed, this is clear from the fact that only Θρ, and not r(·) itself, matters for the computation of the QLR statistic). (ii) It is essential, in order for the test statistic to have a limiting χ2[dr] distribution, that the efficient weight matrix be used. (iii) The Wald and QLR tests are asymptotically equivalent: not only do both share the same distributional limit, but it may be shown that Wn − QLRn

p

→ 0. Example 2.4.

  • Returning to the setting of Example 2.3, in view of (2.22) the subset of the parameter

space B consistent with the restriction Eλi = 0.5 is given by B0.5 :=

  • β ∈ B |
  • R

λ(β1 + β2ξ)φ(ξ) dξ = 0.5

  • = {β ∈ B | r(β) = 0.5}
  • Standard numerical constrained optimisation routines are perfectly capable of hand-

ling nonlinear equality constraints of the form r(β) = 0.5, permitting Qn(β, ˆ S−1

n ) to

be numerically minimised over B0.5. 42

slide-47
SLIDE 47

2.A Suggested (optional) further reading

  • I have not followed any particular reference here.
  • You may wish to consult Davidson and MacKinnon (2004, Ch. 9), Greene (2008,
  • Ch. 15), Hayashi (2000, Ch. 3 & 4), and/or Wooldridge (2002, Ch. 14).

43

slide-48
SLIDE 48

44

slide-49
SLIDE 49

ECONOMETRICS 1, MT 2016 20/04/17

  • J. A. DUFFY

3 Maximum likelihood

3.1 Introduction

3.1.1 Parametric and semiparametric estimation

  • The estimation methods considered so far in this course are sometimes termed semi-

parametric, because they do not require the (joint) distribution of the data to be completely specified. For example: (i) OLS estimation in the linear regression model: yi = xT

i β0 + ui;

we assumed Exiui = 0 (or possibly E[ui | xi] = 0), but said nothing about the distribution of the ui’s (and the xi’s); (ii) 2SLS estimation in the linear IV model: yi = xT

i β0 + ui;

we assumed Eziui = 0 (IV-ORTH), along with rk EzixT

i = dx (IV-RANK), but were

silent about the distribution of the ui’s (and the xi’s); (iii) GMM estimation based on solving the moment conditions Eg(wi, θ) = 0; here we only needed to assume enough about the (marginal) distribution of wi to ensure that this equation had a unique solution at θ = θ0 (GMM-ID).

  • Semiparametric estimation methods are often attractive, precisely because they al-

low us to remain relatively agnostic about those aspects of the model (e.g. the distribution of the errors in a linear regression model) that are not of interest to us.

  • Parametric (or ‘fully’ parametric) estimation methods, by contrast, require the entire

joint distribution of the data – or, at a minimum, certain conditional distributions – to be completely specified. – This means there is a correspondingly greater risk, when using these meth-

  • ds, of misspecifying (some components of) the model, which may lead to the

parameters of interest being inconsistently estimated. 45

slide-50
SLIDE 50

– On the other hand: if we are prepared to make these assumptions, then we can lever off these to obtain estimates of the parameters of interest that are often much more efficient than provided by semiparametric methods. – In particular, the method of maximum likelihood estimation – which, in a wide class of problems, is more efficient than any other estimator – becomes available to us. It is efficient precisely because it fully exploits the information conveyed by the (fully specified) joint distribution of the data, as we shall now discuss.

  • In addition to having attractive efficiency properties, maximum likelihood estimation

is also very widely applicable: essentially, if we can work out what a model implies for the joint density of the observed sample – and we can evaluate that joint density function – then it becomes possible to estimate the parameters of the model by maximum likelihood. 3.1.2 The likelihood function: the general case

  • Suppose that we observe the sample wn := (w1, . . . , wn); takes values in W n.
  • To describe how that sample was generated, we have an econometric model, in-

dexed by some parameter θ ∈ Θ ⊆ Rdθ. The model completely specifies the joint distribution of the sample, by prescribing that it have joint density f (wn; θ) = f (w 1, . . . , w n; θ). (3.1)

  • The estimation problem, as ever, is to recover (as nearly as possible) the true para-

meter θ0 ∈ Θ, under which the observed sample was generated. (We are thus assuming that the model is correctly specified, in the sense that there is a value of the model parameters consistent with what we observe.)

  • The joint density (3.1) can be regarded as a function of either of two arguments:

(i) for a given θ ∈ Θ: wn → f (wn; θ) describes the density of wn associated to that θ (this is the usage that you are already familiar with); and (ii) for a given wn ∈ W n: θ → f (wn; θ) describes the likelihood with which θ could have generated the sample wn. Despite the terminology, a ‘likelihood’ should not be interpreted as a probability (unless one is interested in performing Bayesian inference, which is another matter entirely). 46

slide-51
SLIDE 51
  • In particular, if we now set wn = wn, i.e. if we evaluate the density at the observed

(i.e. realised) sample wn, then θ → f (wn; θ) describes what is termed the likelihood function; we shall denote this by Ln(θ) := Ln(wn, θ) := f (wn; θ). – For each θ ∈ Θ, the value of Ln(θ) reports the likelihood of θ. Note that this value depends on realised value of wn, and thus is itself a random variable, with a sampling distribution. [I.e. it is a ‘random function’, much like the GMM criterion Qn(θ).]

  • The maximum likelihood estimator (MLE) is defined as the maximiser of the like-

lihood function. For reasons that shall become clearer below, it is often easier to work with either the loglikelihood function, ℓn(θ) := log Ln(θ)

  • r the average loglikelihood function,

ℓn(θ) := 1 nℓn(θ) = 1 n log Ln(θ) (the role of the n−1 standardisation will become clearer below). Since the logarithm is strictly monotone, Ln, ℓn and ℓn each share the same maximiser, whence we may equivalently define the MLE as ˆ θn := argmax

θ∈Θ

ℓn(θ). 3.1.3 The likelihood function: with i.i.d. data

  • In this course, we shall focus on the case where the data is i.i.d. with marginal density

p(w; θ), so that the joint density of wn factorises as f (w 1, . . . , w n; θ) =

n

  • i=1

p(w i; θ), and thus the likelihood is the preceding evaluated at the realised data, i.e. Ln(θ) = f (w1, . . . , wn; θ) =

n

  • i=1

p(wi; θ). 47

slide-52
SLIDE 52
  • The utility of the average loglikelihood function now becomes clear, since

ℓn(θ) = 1 n log Ln(θ) = 1 n log

n

  • i=1

p(wi; θ) = 1 n

n

  • i=1

log p(wi; θ) is, for each fixed θ ∈ Θ, an average of the i.i.d. random variables {p(wi; θ)}n

i=1. This

will be important for analysing the consistency of the MLE.

  • In many cases, a further simplification is possible. Suppose that the data partitions

as wi = (yi, xi), where yi and xi are respectively Rdy- and Rdx-valued, and that the model prescribes that the density of wi factorises as p(y, x; θ) = q(y | x; θ)r(x) (3.2) where r denotes the marginal density of xi, and q the conditional density of yi given xi = x. – The key characteristic of (3.2) is not that the joint density of (yi, xi) factorises as the product of a conditional and a marginal density – this is always possible – but that only the conditional distribution depends on the parameters θ. – We can thus afford to remain entirely agnostic about the marginal distribution

  • f xi – just as we would want to in a regression setting (see below). Noting

that, in this case ℓn(θ) = 1 n

n

  • i=1

log p(wi; θ) = 1 n

n

  • i=1

log q(yi | xi; θ) + 1 n

n

  • i=1

log r(xi), where the second term does not depend on θ, maximisation of the average log- likelihood is equivalent to maximisation of the average conditional loglikelihood, ℓ

c n(θ) = 1

n

n

  • i=1

log q(yi | xi; θ).

3.2 Univariate examples

3.2.1 Continuous random variables Example 3.1 (Gaussian location–scale model).

  • wi ∼ N[µ, σ2], so that θ = (µ, σ2) ∈ R × R+ = Θ. Each wi has density

p(w; µ0, σ2

0) =

1 (2πσ2

0)1/2 exp

  • −(w − µ0)2

2σ2

  • 48
slide-53
SLIDE 53
  • The loglikelihood for the i.i.d. sample wn := (w1, . . . , wn) is thus

ℓn(µ, σ2) =

n

  • i=1

log p(wi; µ, σ2) = −n 2 log(2πσ2) − 1 2σ2

n

  • i=1

(wi − µ)2.

  • Maximising this with respect to µ, for any σ, is thus equivalent to minimising

n

  • i=1

(wi − µ)2, which is minimised at ˆ µn = 1

n

n

i=1 wi, the sample mean; so this must be the MLE

for µ0.

  • To find the MLE for σ2

0, take the FOC with respect to σ2 ,

0 = ∂ℓn(µ, σ2) ∂σ2 = − n 2σ2 + 1 2(σ2)2

n

  • i=1

(wi − µ)2. Evaluating at µ = ˆ µn and solving for σ2 thus yields ˆ σ2

n = 1

n

n

  • i=1

(wi − ˆ µn)2, which we recognise as the sample variance of wi, albeit standardised by n−1 (rather than (n − 1)−1).

  • To verify that we have found a maximum, we might also check the second-order

condition: thus ∂2ℓn(µ, σ2) ∂(σ2)2

  • σ2=ˆ

σ2 =

n 2(ˆ σ2)2 − 1 (ˆ σ2)3

n

  • i=1

(wi − ˆ µn)2 = − n 2(ˆ σ2)2 < 0 and thus we have found a local maximum. Since there are no other local maxima (there is only one solution to the FOC), and σ2 = 0 is clearly not a maximiser, ˆ σ2 indeed maximises the loglikelihood.

  • In the preceding example, the sample mean and variance happened to be the max-

imum likelihood estimators. This was a consequence of wi having been assumed Gaussian, and will not always be the case, as the following example illustrates (see also the problem set). 49

slide-54
SLIDE 54

Example 3.2 (uniform model).

  • wi ∼ U[0, θ], with θ ∈ R+ = Θ. wi thus has density

p(w, θ0) = 1{w ∈ [0, θ0]} 1 θ0 .

  • So the loglikelihood for the i.i.d. sample wn := (w1, . . . , wn) is

ℓn(θ) =

n

  • i=1

log

  • 1{wi ∈ [0, θ]}1

θ

  • =

n

  • i=1

log 1{wi ∈ [0, θ]} − n log θ.

  • How do we find the maximiser? This function is not sufficiently well-behaved for us

to blindly take FOCs with respect to θ. Instead, we note that: (i) log 1{wi ∈ [0, θ]} = −∞ if wi / ∈ [0, θ], and zero otherwise: and so ˆ θn must be at least equal to maxi≤n wi; (ii) the second term, −n log θ, penalises larger values of θ; Deduce that the maximiser and thus the MLE is ˆ θn = maxi≤n wi.

  • Observe that in this example, another consistent estimate of θ0 could be provided

by ˜ θn = 2 n

n

  • i=1

wi

p

→ 2Ewi = 2 θ0 w θ0 dw = 2 w 2 2θ0 θ0 = θ0; this turns out to be much less efficient. 3.2.2 Discrete random variables

  • Before proceeding to examples involving discrete (and mixed continuous/discrete)

random variables, we first recall some facts about distribution functions.

  • A random variable wi is said to have a:

(i) continuous distribution with (Lebesgue) density p, if F(w) := P{wi ≤ w} = w

−∞

p(w) dw, (3.3) in which case the distribution function F is continuous (more precisely, it is absolutely continuous); 50

slide-55
SLIDE 55

(ii) discrete distribution with support W and probability mass function p, if F(w) := P{wi ≤ w} =

  • {w∈W |w≤w}

p(w), (3.4) in which case F is merely right-continuous, with jumps of size p(w) at each w ∈ W (and is otherwise flat); we say that wi has probability mass p(w) at w ∈ W .

  • Up until now, we have reserved the term ‘density’ for such a p as appears in (3.3), but

there is a more expansive notion of ‘density’ that also encompasses the probability mass function of a discrete random variable, i.e. such a p as appears in (3.4). For the purposes of constructing the likelihood, it is this extended notion of ‘density’ that is the relevant one. Example 3.3 (binary outcomes).

  • wi ∼i.i.d. Bernoulli[θ]: that is, wi takes values in W = {0, 1} with probabilities

P{wi = 1} = θ P{wi = 0} = 1 − θ and θ ∈ [0, 1] = Θ. [This is useful for modelling outcomes such as: employ- ment status (whether employed); educational attainment (whether completed high school); enrolment in a training programme.]

  • The probability mass function – and therefore the ‘density’ of wi – is thus

p(w; θ0) = (1 − θ0)1{w = 0} + θ01{w = 1} =(2) θw

0 (1 − θ0)1−w, w ∈ {0, 1}

where =(2) is valid since wi only takes values in W = {0, 1}.

  • The loglikelihood for the i.i.d. sample wn := (w1, . . . , wn) is thus

ℓn(θ) =

n

  • i=1

log[θwi(1 − θ)1−wi] = n

  • i=1

wi

  • log θ +
  • n −

n

  • i=1

wi

  • log(1 − θ)
  • Before computing the maximiser, let us note that in the case where wi = 1 for all

i ∈ {1, . . . , n}, the likelihood reduces to ℓn(θ) = n log θ 51

slide-56
SLIDE 56

which is increasing in θ, and thus ˆ θn = 1 is the MLE, because Θ = [0, 1]. (Similarly, if only zeros are observed, then the MLE would be zero).

  • In the more general case where both outcomes are observed, taking the FOC w.r.t.

θ yields 0 = ∂ℓn(θ) ∂θ = n

i=1 wi

θ − n − n

i=1 wi

1 − θ = ⇒ 0 =

1 n

n

i=1 wi

θ − 1 − 1

n

n

i=1 wi

1 − θ whence ˆ θn = 1

n

n

i=1 wi. (Once again, the MLE is the sample average!)

  • Once again, we should check the SOC,

∂2ℓn(θ) ∂θ2 = − n

i=1 wi

θ2

n

− n − n

i=1 wi

(1 − θn)2 <(2) 0, where <(2) holds for all θ ∈ (0, 1). Thus ℓn is strictly concave, and has a global maximum at ˆ θn. Example 3.4 (Poisson model for count data).

  • wi ∼ Poisson[θ]: takes values in W = {0, 1, 2, . . .} with probabilities

P{wi = w} = θwe−θ w! =: p(w; θ) where θ ∈ R+. [Used for modelling, e.g. the number of job offers received by an jobseeker during a given time interval.]

  • The loglikelihood for the i.i.d. sample wn := (w1, . . . , wn) is

ℓn(θ) =

n

  • i=1

log p(wi; θ) = n

  • i=1

wi

  • log θ − nθ −

n

  • i=1

log(wi!).

  • It is left as an exercise to show that ˆ

θn = 1

n

n

i=1 wi also in this case.

3.2.3 Mixed continuous/discrete random variables

  • This not quite the end of the story, because some random variables of interest are

neither continuous nor discrete, but a mixture of the two. The classic example comes from censoring a continuously distributed random variable. 52

slide-57
SLIDE 57

Example 3.5 (censored Gaussian distribution).

  • Consider the wi is formed by censoring w ∗

i ∼ N[µ, 1] at zero:

wi = max{w ∗

i , 0}.

[For example, w ∗

i might represent an individual’s ‘latent propensity’ to purchase meat

products, and wi his or her observed expenditure on such products.]

  • The distribution function of wi is given by

F(w) := P{wi ≤ w} =    if w < 0, P{w ∗

i ≤ w}

if w ≥ 0; = 1{w ≥ 0}Φ(w − µ), which: – is zero for w < 0; – has a jump (and thus a probability mass) of size Φ(w − µ) at w = 0 (the l.h. censor point); and – is continuous, with derivative φ(w − µ), for w > 0. [In the standard notation, φ and Φ respectively denote the standard Gaussian density and distribution function.]

  • For a mixed continuous/discrete random variable of this kind, the ‘density’ (under-

stood in the extended sense) at a point w ∈ R corresponds to – the probability mass at w, if F(w) is discontinuous there; – the derivative of F(w), if F is continuous (and differentiable).

  • Thus the ‘density’ of wi is

p(w; µ) = 1{w = 0}Φ(−µ) + 1{w > 0}φ(w − µ) (3.5)

  • More generally, suppose that the distribution function F of wi:

– has jumps at each w ∈ Wd, which are necessarily of size P{wi = w}; – is otherwise continuous (and differentiable), with derivative f (w) for w / ∈ Wd. 53

slide-58
SLIDE 58

Then the ‘density’ of wi is defined as p(w) := 1{w ∈ Wd}P{wi = w} + 1{w / ∈ Wd}f (w) – You might find it a little disturbing that the value reported by p may be either a probability mass, or a (Lebesgue) density, depending on whether or not it is evaluated at a point in Wd. – But so long as we also keep track of Wd, p can be used to faithfully reconstruct the distribution function of wi, via F(w) =

  • {w∈Wd|w≤w}

p(w) +

  • {w /

∈Wd|w≤w}

p(w) dw =(2)

  • {w∈Wd|w≤w}

P{wi = w} +

  • {w /

∈Wd|w≤w}

f (w) dw, where =(2) follows by the definition of f . Example 3.6 (censored Gaussian distribution; continued).

  • Suppose wn := (w1, . . . , wn) is an i.i.d. sample of random variables of the form

wi = max{w ∗

i , 0}

for w ∗

i ∼ N[µ, 1]. Recall from (3.5) that wi has density

p(w; µ) = 1{w = 0}Φ(−µ) + 1{w > 0}φ(w − µ)

  • Thus the sample loglikelihood is

ℓn(µ) =

n

  • i=1

log {1{wi = 0}Φ(−µ) + 1{wi > 0}φ(wi − µ)} = #{wi = 0} log Φ(−µ) +

  • {i|wi>0}

log φ(wi − µ) =(3) n0 log Φ(−µ) − n1 2 log 2π − 1 2

  • {i|wi>0}

(wi − µ)2 where n0 := #{wi = 0} and n1 := #{wi > 0}, the number of censored and uncensored observations respectively; and =(3) follows from φ(x) = (2π)−1/2e−x2.

  • There is no closed form solution for the maximum likelihood estimator in this case:

it would have to be computed numerically. 54

slide-59
SLIDE 59
  • We might have instead assumed that w ∗

i ∼ N[µ, σ2], rather than forcing σ2 = 1. It

is left as an exercise to show that, in this case, the sample loglikelihood is ℓn(µ, σ2) = n0 log Φ

  • −µ

σ

  • − n1

2 log(2πσ2) − 1 2σ2

  • {i|wi>0}

(wi − µ)2

3.3 Models with covariates

  • Many of the preceding models can be rendered more flexible by allowing certain

parameters to depend on a set of covariates xi. Example 3.7 (homoskedastic Gaussian regression model).

  • Recall the univariate Gaussian location–scale model (Example 3.1) postulated wi ∼

N[µ0, σ2

0]. We now wish to generalise this to

yi | xi ∼ N[xT

i β0, σ2 0],

(3.6) so that the parameter µ0 has effectively been replaced by the linear index xT

i β0. [In

the usual parlance, the conditional mean of the outcome yi is now said to vary with ‘individual characteristics’ xi or with ‘observable heterogeneity’.]

  • An entirely equivalent way of writing (3.6) is

yi = xT

i β0 + σ0ǫi

where ǫi ∼ N[0, 1], independent of xi.

  • In view of (3.6), the density of yi conditional on xi clearly depends on (β0, σ2

0), but

the marginal distribution of xi does not. Thus the joint density of (yi, xi) conveniently factorises as p(y, x; β0, σ0) = q(y | x; β0, σ2

0)r(x).

  • As discussed in Section 3.1.3 above, for the purposes of calculating the MLE it

suffices to consider the conditional loglikelihood, ℓc

n(β, σ2) = n

  • i=1

log q(yi | xi; β, σ2) =

n

  • i=1

log 1 σφ yi − xT

i β

σ

  • = −n

2 log(2πσ2) − 1 2σ2

n

  • i=1

(yi − xT

i β)2.

  • This bears more than a passing resemblance to the OLS criterion function, and

indeed it is immediately clear that the MLE for β0 must be the OLS estimator. It is 55

slide-60
SLIDE 60

left as an exercise to show that the MLE for σ2

0 is

ˆ σ2

n := 1

n

n

  • i=1

(yi − xT

i ˆ

βn)2. Example 3.8 (heteroskedastic Gaussian regression model).

  • Now suppose we were to additionally replace σ2

0 by σ2(zT i γ0), where zi is another

collection of covariates (possibly overlapping partially or wholly with xi). Here σ2 : R → R+ denotes some increasing, positive-valued transformation, such as the exponential function.

  • We could postulate

yi | xi, zi ∼ N[xT

i β0, σ2(zT i γ0)],

  • r equivalently

yi = xT

i β0 + σ(zT i γ0)ǫi.

  • In this case, the conditional density of yi given (xi, zi) is

q(yi | xi, zi; β0, γ0) = 1 σ(zT

i γ0)φ

yi − xT

i β0

σ(zT

i γ0)

  • whence the conditional loglikelihood is

ℓc

n(β, γ) = −n

2 log 2π − n log σ(zT

i γ) − 1

2

n

  • i=1

(yi − xT

i β)2

σ2(zT

i γ) .

  • For a given value of γ, ℓc

n(β, γ) is thus maximised with respect to β by the generalised

least squares (GLS) estimator, where the variance of the ith residual is assumed to be σ2

i = σ2(zT i γ).

  • However, since there is evidently no way of maximising ℓc

n by sequentially maximising

with respect to γ and then β (or vice versa), the only way to compute the MLE is by maximising ℓc

n simultaneously with respect to β and γ (which typically must be

done numerically).

  • This procedure yields efficient estimates of both the regression coefficients β0 and

the parameters γ0 describing the conditional variance of the errors. Example 3.9 (probit and logit regression).

  • Recall the binary outcomes model (Example 3.3) postulated that wi could take values

in W = {0, 1}, with P{wi = 1} = θ. 56

slide-61
SLIDE 61
  • Once again, we should like to generalise this model by allowing

P{yi = 1 | xi} = θ(xT

i β0),

where θ : R → [0, 1] is a continuous, increasing function that maps the linear index xT

i β0 into the unit interval, with limz→−∞ θ(z) = 0 and limz→+∞ θ(z) = 1. [In this

way, the probability of say, being employed, is allowed to depend on an individual’s

  • bservable characteristics.]
  • θ(·) is thus a distribution function; common choices here are the standard Gaussian

and standard logistic distribution functions, denoted Φ and Λ respectively. These choices respectively yield the probit and logistic regression models.

  • We shall now consider the probit model in a little more detail. Observe that specifying

P{yi = 1 | xi} = Φ(xT

i β0)

(3.7) is equivalent to specifying yi = 1{xT

i β0 + ǫi ≥ 0}

(3.8) for ǫi ∼ N[0, 1] independent of xi, since under (3.8), P{yi = 1 | xi} = P{xT

i β0 + ǫi ≥ 0 | xi}

= P{ǫi ≥ −xT

i β0 | xi}

= 1 − Φ(−xT

i β0)

=(4) Φ(xT

i β0),

where =(4) follows by the symmetry of the standard Gaussian distribution around the

  • rigin.
  • The conditional ‘density’ (in the extended sense) of yi given xi is just its probability

mass function, which in view of (3.7) is q(y | x; β0) = Φ(xTβ0)y[1 − Φ(xTβ0)]1−y for y ∈ {0, 1}.

  • As in the immediately preceding examples, since the marginal density of xi is assumed

57

slide-62
SLIDE 62

not to depend on β0, the MLE may be characterised as the maximiser of ℓc

n(β) = n

  • i=1

log q(yi | xi; β) =

n

  • i=1

{yi log Φ(xT

i β) + (1 − yi) log[1 − Φ(xT i β)]}

=

  • {i|yi=1}

log Φ(xT

i β) +

  • {i|yi=0}

log[1 − Φ(xT

i β)].

Example 3.10 (censored regression).

  • Continuing with Examples 3.5–3.6: suppose that y ∗

i measures an individual’s under-

lying (or ‘latent’) ‘propensity’ to purchase meat, and this is modelled as a Gaussian regression, y ∗

i = xT i β0 + σ0ǫi

(3.9) where ǫi ∼ N[0, 1], independent of xi. We observe actual expenditure on meat, which is related to y ∗

i through

yi = max{y ∗

i , 0}.

(3.10)

  • (3.9) and (3.10) entail that, conditional on xi, the distribution function yi is

F(y | x) = P{yi ≤ y | xi = x} =    if y < 0, P{y ∗

i ≤ y | xi = x}

  • therwise.

= 1{y ≥ 0}Φ y − xTβ0 σ0

  • This function:

– is zero for y < 0; – has a jump at y = 0 of size Φ −xTβ0 σ0

  • = 1 − Φ

xTβ0 σ0

  • (which is necessarily equal to P{yi = 0 | xi = x}); and

– is continuous on (0, ∞), with derivative ∂ ∂y Φ y − xTβ0 σ0

  • = 1

σ0 φ y − xTβ0 σ0

  • .

58

slide-63
SLIDE 63
  • The conditional ‘density’ of yi given xi, in the extended sense, is thus

q(y | x; β0, σ0) = 1{y = 0}

  • 1 − Φ

xTβ0 σ0

  • + 1{y > 0} 1

σ0 φ y − xTβ0 σ0

  • .
  • Hence the conditional loglikelihood is

ℓc

n(β, σ) = n

  • i=1

log q(yi | xi; β, σ) =

  • {i|yi=0}

log

  • 1 − Φ

xT

i β0

σ0

  • +
  • {i|yi>0}

log 1 σ0 φ yi − xT

i β0

σ0

  • ,

which can be further simplified, if so desired, using the formula for φ. (The utility of doing so is somewhat limited, since the MLE does not have a closed-form solution in this case.)

3.4 Consistency and identification

3.4.1 Consistency

  • As with GMM, since closed-form expressions for the MLE are generally unavailable,

consistency of the MLE will have to be inferred from the ‘consistency’ of the asso- ciated criterion function: here, the average loglikelihood.

  • Suppose that model assumes that the data {wi}n

i=1 is i.i.d. with marginal density

p(w; θ) (this may be a ‘density’ in the extended sense). The MLE maximises ℓn(θ) = 1 n

n

  • i=1

log p(wi; θ). (3.11) – The arguments that follow encompass the case where wi = (yi, xi), with a density that factorises as p(w; θ) = q(y | x; θ)r(x). Although we may compute the MLE as the maximiser of ℓ

c n = 1

n

n

  • i=1

q(yi | xi; θ) in this case, the resultant estimator nevertheless maximises (3.11). (We would not generally want to compute (3.11), as we wish to remain agnostic about the marginal density r.) 59

slide-64
SLIDE 64
  • By the LLN, provided E|log p(wi; θ)| < ∞, we have

ℓn(θ) = 1 n

n

  • i=1

log p(wi; θ)

p

→ E log p(wi; θ) =: ℓ0(θ). (3.12)

  • On the basis of (3.12), the same arguments that we applied to the GMM estimator

will allow us to infer that the MLE, being the maximiser of ℓn, should converge to the maximiser of ℓ0. This will indeed be the case, if the convergence in (3.12) is suitably strengthened [this is covered in the second-year course].

  • It thus remains to show that θ0 maximises ℓ0: i.e. we must show that the loglikelihood

is sufficient to identify θ0. This will only be true under certain additional assumptions; the most crucial of which is

ML-ID For each θ = θ0, there exists a w ∈ W such that

p(w; θ) = p(w; θ0) This means that θ’s distinct from θ0 are associated with distinct density functions. If this condition does not hold, then the model is redundantly parametrised, in the sense that changing (some elements of) θ will have no effect on the implied distribution

  • f wi.
  • Upon reflection, it should make perfect sense that if θ′ and θ0 both imply the same

density (and thus, distribution) for wi, there is no way we could ever distinguish between these two parameters, no matter how much data we have our disposal. Indeed, it must be the case that the loglikelihoods always agree in this case, since ℓn(θ′) =

n

  • i=1

log p(wi; θ′) =

n

  • i=1

log p(wi; θ0) = ℓn(θ0). Hence ML-ID is fundamentally necessary for identification. 3.4.2 Identification via Kullback–Leibler minimisation

  • Rather then regard the MLE as the maximiser of ℓn, we may equivalently regard it

as maximising the ‘centred’ average loglikelihood, Mn(θ) := ℓn(θ) − ℓn(θ0) = 1 n

n

  • i=1

log p(wi; θ) p(wi; θ0). 60

slide-65
SLIDE 65

Again by the LLN, Mn(θ)

p

→ E log p(wi; θ) p(wi; θ0) = ℓ0(θ) − ℓ0(θ0) =: M0(θ).

  • We shall now show that M0(θ) is uniquely maximised at θ = θ0. In this case, θ0 is

identified from the likelihood function and – by the arguments noted above – the MLE will be consistent for θ0. – To simplify the argument that follows, let us suppose that wi is continuously distributed, with density p(w; θ0).

  • To this end, we first consider a closely related quantity

dKL(f g) = log f (w) g(w)

  • f (w) dw = Ef log f (wi)

g(wi) termed the Kullback-Leibler (KL) divergence or relative entropy between f and g; here the ‘f ’ subscript on Ef signifies that wi has density f . (i) dKL(f g) measures the extent of the ‘disagreement’ between f from g, in the sense that dKL(f g) ≥ 0, with equality if and only if f (w) = g(w) for all w ∈ W . (ii) On the other hand, dKL(f g) is not a metric (a measure of distance) in the usual sense; it is not even symmetric, since in general dKL(f g) = dKL(gf ).

  • To verify the first claim, we recall the following version of Jensen’s inequality: if u

is a strictly concave function, and ηi a random variable, then Eu(ηi) ≤ u(Eηi), with equality if and only if ηi is constant. Thus dKL(f g) = Ef log f (wi) g(wi) = −Ef log g(wi) f (wi) ≥ − log Ef g(wi) f (wi) = − log

  • W

g(w) f (w)f (w) dw = − log

  • W

g(w) dw = − log 1 = 0, – The equality holds strictly, unless the ratio g(wi)/f (wi) is in fact constant. – It may be shown that if f (w) = g(w) for at least some w ∈ W , then this ratio cannot be constant. [This is intuitively obvious, but giving a completely rigorous argument involves some technicalities.]

  • Returning now to M0, we see that

M0(θ) = E log p(wi; θ) p(wi; θ0) = −E log p(wi; θ0) p(wi; θ) =(3) −dKL(pθ0pθ), 61

slide-66
SLIDE 66

using pθ a shorthand for the function w → p(w; θ); =(3) holds by recognising that wi is assumed to have density p(w; θ0) (to signify which, we might have written ‘Epθ0’ instead of merely ‘E’).

  • Thus, maximising M0(θ) with respect to θ is equivalent to choosing θ so as to

minimise the KL divergence between pθ and pθ0. – Clearly, dKL(pθ0pθ0) = 0, so θ = θ0 is indeed a minimiser of this divergence. – Is it the only minimiser? Under ML-ID, it is: since that condition guarantees that for each θ ∈ Θ, there exists a w ∈ W for which p(w; θ0) = p(w; θ).

3.5 Asymptotic distribution of the MLE

3.5.1 Asymptotic normality

  • The asymptotic distribution of the MLE can be derived via the same sort of linear-

isation argument as was used for GMM (Section 2.2.2). In fact, the arguments are slightly simpler. We shall assume

ML-DIFF θ → p(w; θ) is twice continuously differentiable, for every w ∈ W .

  • Once again, we proceed from the FOC for an interior maximum:

0 = ∇θℓn(ˆ θn). (3.13) This condition holds only if ˆ θn is not on the boundary of Θ; recall from our analysis

  • f GMM (Section 2.2.2 above) that we could deal with this requirement by assuming

INTR θ0 ∈ int Θ,

which together with the consistency of ˆ θn ensures that the FOC (3.13) holds with probability approaching 1. – When might INTR fail? Consider again the simple binary outcomes model (Ex- ample 3.3). There Θ = [0, 1], and so int Θ = (0, 1); INTR is thus excluding the possibility that θ0 is either 0 or 1 (in which case, the designated outcome either never occurs, or always occurs).

  • Using the same sort of argument as was given in the context of GMM, a mean-

value expansion of ∇θℓn(θ) around θ0, combined with the consistency of ˆ θn (and the convergence of ∇2

θℓn(θ) to ∇2 θℓ0(θ)) yields

∇θℓn(ˆ θn) = ∇θℓn(θ0) + Hn(ˆ θ − θ0) (3.14) 62

slide-67
SLIDE 67

where Hn

p

→ ∇2

θℓ0(θ0) =: H. (Again, as discussed in the context of GMM, we need

INTR and consistency to ensure the validity of (3.14).)

  • Applying (3.14) to (3.13) thus yields

n1/2(ˆ θ − θ0) = −H−1

n n1/2∇θℓn(θ0).

(3.15) It remains to determine the limiting behaviour of ∇θℓn(θ0).

  • In the usual terminology,

∇θℓn(θ) = 1 n

n

  • i=1

∇θ log p(wi; θ) =: 1 n

n

  • i=1

s(wi; θ) is termed the score of the (average) loglikelihood function at θ; a key property is Es(wi, θ0) = 0. (3.16)

  • This can be proved in various ways; but since we have already shown that θ0 uniquely

maximises M0, and therefore ℓ0, it follows from the FOC for a maximum that 0 = ∇θℓ0(θ0) = ∇θE log p(wi; θ0) =(3) E∇θ log p(wi; θ0) = Es(wi; θ0) (3.17) provided that the interchange of the expectation (an integral) and the derivative

  • perator in =(3) can be justified [as will be the case under appropriate regularity

conditions; see the second-year course].

  • In view of (3.16), provided Es(wi; θ0)2 < ∞, then

n1/2∇θℓn(θ0) = 1 n1/2

n

  • i=1

s(wi; θ0)

d

→ N[0, S] by the CLT, where S := Es(wi; θ0)s(wi; θ0)T. Combining the preceding with (3.15) thus yields n1/2(ˆ θ − θ0) = −H−1

n n1/2∇θℓn(θ0) d

→ −H−1 · N[0, S] ∼ N[0, H−1SH−1] (3.18) by Slutsky’s theorem.

  • Of course, (3.18) only makes sense if H = ∇2

θℓ0(θ0) is invertible (has full rank); we

shall accordingly assume

ML-HESS rk H = dθ.

63

slide-68
SLIDE 68

Since ℓ0 is maximised at θ0, H must be negative semi-definite there; and ML-HESS is equivalent to an assumption of negative definiteness.

  • To ensure that standard testing procedures (see below) work as intended, we should

also require that

ML-VAR S := Es(wi; θ0)s(wi; θ0)T is positive definite.

  • The asymptotic behaviour of the MLE may thus be summarised as follows.

Theorem 3.1. Under INTR, ML-ID, ML-DIFF, ML-HESS, ML-VAR and further regularity condi- tions [discussed in the second-year course], n1/2(ˆ θ − θ0)

d

→ N[0, H−1SH−1] (3.19) where H = ∇2

θℓ0(θ0) and S := Es(wi; θ0)s(wi; θ0)T.

  • Remarks:

(i) The theorem does not apply to the MLE in the uniform model (Example 3.2 above): recall in that case p(w; θ) = 1{w ∈ [0, θ]}1 θ which is certainly not continuously differentiable, contrary to ML-DIFF. (Actually, the theorem can be shown to hold under conditions weaker than ML-DIFF – but even these weaker conditions still exclude the uniform model.) (ii) Wald statistics require consistent estimates of H and S; these can be computed as ˆ Hn := ∇2

θℓn(ˆ

θn) ˆ Sn := 1 n

n

  • i=1

s(wi; ˆ θn)s(wi; ˆ θn)T. 3.5.2 Efficiency properties

  • Theorem 3.1 is not quite the final word on the limiting distribution of the MLE.
  • The same conditions that are sufficient for that result are also sufficient for

H = −S whereupon (3.19) simplifies to n1/2(ˆ θ − θ0)

d

→ N[0, S−1] (3.20) 64

slide-69
SLIDE 69
  • S is termed the information matrix, and the relation H = −S the information equal-
  • ity. Evidently, when the information equality holds, ML-HESS and ML-VAR are equival-

ent.

  • (3.20) is important, because of the following result, known as the convolution

theorem, which runs roughly as follows. Suppose the information equality holds [amongst other conditions], and let ˜ θn be any regular estimator of θ0. Then n1/2(˜ θn − θ0)

d

→ ξ + η where ξ ∼ N[0, S−1], and η is another random variable (which will depend on the estimator ˜ θn), independent of ξ. – For the MLE, η = 0 in view of (3.20): but for any other estimator, it need not be. – Because ξ and η are independent, ξ + η will a have distribution that is more dispersed than that of ξ, by essentially any reasonable measure of dispersion. Formally, it can be shown bowl-shaped loss function ρ(x), Eρ(ξ + η) ≥ Eρ(ξ). A function is bowl-shaped if its lower level sets {x | ρ(x) ≤ c} are convex, and symmetric around zero. This encompasses a very broad range of plausible loss functions, such as ρ(x) = |x|, x2, and x21{|x| ≤ M}. – If we were to use any such loss function to quantify the (asymptotic) dispersion

  • f an estimator, we would find that every regular estimator is (weakly) more

dispersed than the MLE. Choosing ρ(x) = x2 here amounts to making this comparison in terms of asymptotic variance, so it is indeed true that the MLE has the minimum asymptotic variance amongst all regular estimators.

  • What is meant by a ‘regular estimator’? It is beyond the scope of this course to

go give a precise definition; all we can say here is that the concept is very broad, certainly broad enough to cover all the estimators that we have thus far considered in this course. (For more details, see van der Vaart, 1998, Ch. 8.)

3.6 Hypothesis testing

  • Suppose now we are interested in testing the (possibly nonlinear) restriction

H0 : r(θ0) = ρ against H1 : r(θ0) = ρ. 65

slide-70
SLIDE 70

Suppose that r : Θ → Rdr, with Jacobian R := Dθr(θ0) having rank dr.

  • In the present setting, there are three ways of carrying out such a test, which differ

according to how they evaluate the extent to which H0 is consistent with the observed

  • sample. In order to describe each of these, let

Θρ := {θ ∈ Θ | r(θ) = ρ} denote the subset of parameters that are consistent with the restriction under test, and define ˆ θn,U := argmax

θ∈Θ

ℓn(θ) ˆ θn,R := argmax

θ∈Θρ

ℓn(θ) the unrestricted and restricted estimators of θ0, respectively. (i) The Wald test, as we have noted already, evaluates H0 by computing r(ˆ θn,U)−ρ; i.e. it evaluates the restriction at the unrestricted estimator. This leads to the following test statistic, Wn := n[r(ˆ θn,U) − ρ]T( ˆ Rn ˆ Vn ˆ RT

n )[r(ˆ

θn,U) − ρ]

d

→ χ2[dr], where ˆ Rn := Dθr(ˆ θn,U) estimates R, and ˆ Vn estimates the limiting variance of ˆ θn,U; such a choice as ˆ Vn := ˆ H−1

n ˆ

Sn ˆ H−1

n

would be appropriate. [In view of the information equality, either ˆ S−1

n

  • r − ˆ

H−1

n

could also possibly be used.] (ii) The likelihood ratio (LR) test evaluates H0 by considering the extent to which the restriction leads to a reduction in the value of the maximised loglikelihood. Thus we have the test statistic LRn := 2

  • max

θ∈Θ ℓn(θ) − max θ∈Θρ ℓn(θ)

  • d

→ χ2[dr]. (Note that this is indeed the loglikelihood ℓn that appears here, not the average loglikelihood.) (iii) The Lagrange multiplier (LM) or score test is motivated as follows. Recall that our characterisation of θ0 as the maximiser of ℓ0(θ) = E log p(wi, θ) led to ∇θℓ0(θ0) = 0 (see (3.17) above). Indeed, the sample analogue of this condition holds exactly 66

slide-71
SLIDE 71

(provided ˆ θn is interior to Θ), ∇θℓn(ˆ θn,U) = 0. On the other hand, if we evaluate the score ∇θℓn at the restricted estimator ˆ θn,R, then ∇θℓn(ˆ θn,R) = 0, in general. This can itself be used as a measure of the extent to which H0 is consistent with the observed sample, and leads to the following test statistic, LMn := n∇θℓn(ˆ θn,R)T ˆ S−1

n ∇θℓn(ˆ

θn,R)

d

→ χ2[dr]. (3.21)

  • Remarks:

(i) All three tests have the same limiting distribution, and are asymptotically equi- valent. (ii) The LR test is invariant to the parametrisation of the model and the formulation

  • f the restrictions under test, whereas the other two tests are not.

(iii) The LM test only requires that we calculate the restricted estimator ˆ θn,R, which may be advantageous in some contexts (i.e. if the unrestricted estimator is particularly difficult to compute). (iv) If the information equality fails to hold: then the Wald test (using ˆ Vn = ˆ H−1

n ˆ

Sn ˆ H−1

n ) remains valid; and it is possible to modify the LM statistic such

it also remains valid ( ˆ S−1

n

in (3.21) needs to be replaced by another, rather more complicated matrix). However, the validity of the LR test hinges crucially

  • n this condition.

3.A Suggested (optional) further reading

  • I have not followed any particular reference here.
  • You may wish to consult Davidson and MacKinnon (2004, Ch. 10), Greene (2008,
  • Ch. 16) and/or Wooldridge (2002, Ch. 13).

67

slide-72
SLIDE 72

68

slide-73
SLIDE 73

ECONOMETRICS 1, MT 2016 20/04/17

  • J. A. DUFFY

4 References

Davidson, R., and J. G. MacKinnon (2004): Econometric Theory and Methods. Oxford University Press. Greene, W. H. (2008): Econometric Analysis. Pearson Prentice Hall, New Jersey (USA), 6th edn. Hayashi, F. (2000): Econometrics. Princeton University Press. van der Vaart, A. W. (1998): Asymptotic Statistics. Cambridge University Press, New York (USA). Wooldridge, J. M. (2002): Econometric Analysis of Cross Section and Panel Data. MIT Press, 1st edn. 69

slide-74
SLIDE 74

70

slide-75
SLIDE 75

ECONOMETRICS 1, MT 2016 20/04/17

  • J. A. DUFFY

A Mathematical appendix

A.1 Notation

Norms. Each of the spaces R, Rk, or Rk×l is equipped with a norm x, which reduces to |x| when x ∈ R; otherwise:

  • for x := (x1, . . . , xk)T ∈ Rk, x := maxi≤k|xi|;
  • for x := [xij] ∈ Rk×l, x := maxi≤k maxj≤l|xij|;

where the notation x := [xij] indicates that x is a matrix with (i, j) element given by

  • xij. These are not the only possible norms with which the spaces Rk and Rk×l could be

equipped, but they are particularly convenient for our purposes. (For example, x2 := (k

i=1 x2 i )1/2 is also a common choice on Rk.)1

A norm provides a measure of the ‘length’ of a vector (or matrix), and x−y a measure

  • f the ‘distance’ between two vectors (or matrices) x and y. The three characteristic

properties of a norm, which · inherits by its construction, are:

  • x ≥ 0, with equality if and only if x = 0;
  • αx = |α|x for all α ∈ R; and
  • triangle inequality: x − y ≤ x − z + y − z for all x, y, z ∈ Rk (or Rk×l).

A.2 Matrices

Rank. Let A be a k × l matrix. The rank of A, denoted rk A, is the number of linearly independent rows, or the number of linearly independent columns; these two numbers necessarily agree (though this requires some effort to prove). Thus rk A ≤ min{k, l}. We say that A has full row rank if rk A = k, and full column rank if rk A = l: evidently only

  • ne of these can be true unless A is square (k = l), in which case A is said simply to be
  • f full rank. Some useful properties, which we state here without proof, are:
  • if Aδ = 0 for all δ ∈ Rl\{0}, then the columns of A must be linearly independent,

and thus rk A = l;

  • a square matrix A has an inverse if and only if it is full rank (we say, equivalently,

that A is nonsingular or invertible in this case);

1None of our results depend on the specific choice of norm, since all norms on a finite-dimensional space

are equivalent. That is, if ·∗ is another norm on Rk, then there must exist nonzero, finite constants c0 and c1 such that c0x ≤ x∗ ≤ c1x. Convergence (in probability, or in distribution) in one norm thus implies convergence in all the others.

71

slide-76
SLIDE 76
  • rk(AB) ≤ min{rk A, rk B} for B an l ×m matrix, with equality if A has full rank; and
  • rk ATA = rk A.

Definiteness. A (real) symmetric k × k matrix A is termed

  • positive semi-definite if xTAx ≥ 0 for all x ∈ Rk; and
  • positive definite if xTAx > 0 for all x ∈ Rk\{0}.

A is termed negative definite (semi-definite) if −A is positive definite (semi-definite). Various characterisations of definiteness are available, the most useful of which comes from the eigenvalues (or spectrum) of A. Eigenvalues. Recall that the eigenvalues of a k × k matrix A are given by the solutions to the equation, det(λI − A) = 0. Since the l.h.s. is a kth order polynomial, it has k roots, some of which may be repeated. A thus has k eigenvalues, though these need not all be distinct. For example, the eigenvalues

  • f the matrix

A =     2 2     correspond to the solutions to det(λI − A) = (λ − 2)2λ = 0, and so the eigenvalues of A are {0, 2, 2}. For a general (real) matrix A, there is no guarantee that the eigenvalues of A will be

  • real. However, the following are true:
  • if A is symmetric, then the eigenvalues of A are real, and the rank of A is equal to

the number of nonzero eigenvalues;

  • a symmetric matrix A is positive definite (semi-definite) if and only if all its eigen-

values are strictly positive (non-negative);

  • the eigenvalues of a symmetric and idempotent matrix A are either 0 or 1.

72

slide-77
SLIDE 77

Spectral decomposition. If A is a (real) symmetric k × k matrix, then it admits the decomposition A = CΛCT where Λ = diag{λ1, . . . , λk} is a diagonal matrix whose diagonal entries correspond to the eigenvalues of A, and C has the property that C−1 = CT (C is termed an orthonormal matrix). You may directly verify that: A−1 = CΛ−1CT where Λ−1 = diag{λ−1

1 , . . . , λ−1 k }.

Positive (semi-)definite square root. If A is positive (semi-)definite, then the matrix B := CΛ1/2CT where Λ1/2 := diag{λ1/2

1 , . . . , λ1/2 k }, is itself positive (semi-)definite, and has the property

that B2 = BB = CΛ1/2CTCΛ1/2CT = CΛ1/2Λ1/2CT = CΛCT = A. B is thus a square root of the matrix A. You may verify that B is positive (semi-)definite: in fact, it is the only square root of A with this property.

A.3 Asymptotics

A.3.1 Modes of stochastic convergence Let xn denote a sequence of random scalars, vectors or matrices, taking values in X (a subset of R, Rk or Rl×m). Convergence in probability. We say that xn converges in probability to x∞, denoted xn

p

→ x∞, if for every ǫ > 0, lim

n→∞ P{xn − x∞ > ǫ} = 0.

The definition allows that x∞ may itself be a random variable, although in almost all of the cases considered in this course, it will be a constant. 73

slide-78
SLIDE 78

Convergence in distribution. Let xn have the distribution function Fn (denoted xn ∼ Fn), i.e. Fn(x) := P{xn ≤ x} for x ∈ X ; and let x∞ ∼ F∞, i.e. F∞(x) := P{x∞ ≤ x}. When xn is vector- or matrix- valued, the inequality xn ≤ x is to interpreted componentwise, i.e. for vectors xn = (xn,1, . . . , xn,k)T and x = (x1, . . . , xk)T, xn ≤ x ⇐ ⇒ xn,k ≤ xk, ∀k ∈ {1, . . . , dx}, and similarly for matrices. We say xn converges in distribution to x∞, denoted xn

d

→ x∞, if lim

n→∞ Fn(x) = F∞(x)

for every x that is a continuity point of F∞. A.3.2 Key results Let xn and yn denote random sequences respectively taking values in X and Y (each subsets of either R, Rk or Rl×m). The next theorem and its corollary provide conditions under which convergence in probability and distribution are preserved under continuous

  • maps. (These are not the most general possible results of their kind, but they are sufficient

for our purposes.) Theorem A.1 (Slutsky). Suppose (i) h : X × Y → Rdh is continuous on X × {a}; (ii) xn

d

→ x∞; and (iii) yn

p

→ a, where a ∈ Y is constant; Then h(xn, yn)

d

→ h(x∞, a). Corollary A.1. Let xn and yn be as in the statement of Theorem A.1. Then (i) xn + yn

d

→ x∞ + a; (ii) ynxn

d

→ ax∞; (iii) y −1

n xn d

→ a−1x∞, if a = 0 (if Y = R) or a is invertible (if Y = Rk×k). 74

slide-79
SLIDE 79

Theorem A.2 (LLN and CLT for i.i.d. variates). Suppose {wn} is a sequence of i.i.d. random vectors, taking values in Rdw. (i) If Ew1 < ∞, then 1 n

n

  • i=1

wi

p

→ Ew1. (ii) If Ew12 < ∞, then 1 n1/2

n

  • i=1

(wi − Ewi) N[0, V ] where V := E(w1 − Ew1)(w1 − Ew1)T.

A.4 Suggested (optional) further reading

  • Hayashi (2000, Ch. 2) and/or Greene (2008, App. A & D).

75