[PPT] - Non-linear Least Squares and Durbins Problem Asymptotic Theory PowerPoint Presentation

SLIDE 1

Non-linear Least Squares and Durbin’s Problem

Asymptotic Theory — Part V

James J. Heckman University of Chicago Econ 312 This draft, April 18, 2006

SLIDE 2

This lecture consists of two parts:

1. Non-linear least squares: This looks at Non-linear least

squares estimation in detail; and

2. Durbin’s problem: This examines the correction of as-

ymptotic variances in the case of two stage estimators. 1

SLIDE 3

1 Nonlinear Least Squares

In this section, we examine in detail the Non-linear Least Squares estimator. The section is organized as follows:

Section 1.1: Recap the analog principle motivation for

the NLLS estimator (using the extremum principle);

Section 1.2: Consistency of the NLLS estimator;
Section 1.3: Draw analogy with the OLS estimator;
Section 1.4: Asymptotic normality of NLLS estimator;
Section 1.5: Discussion of asymptotic eciency;
Section 1.6: Estimation of b

. 2

SLIDE 4

1.1 NLLS estimator as an application of the Extremum principle

Here we recap the derivation of the NLLS estimator as an application of the Extremum principle, from section 3.2 of the notes Asymptotic Theory II, with slight modification in no- tation. As noted there, we could also motivate NLLS as a moment estimator (refer section 3.2 of Asymptotic Theory II).

1. The model: We assume that in the population the fol-

lowing model holds:

=

( | 0) + (1) = (; ) + [(; 0) (; )] + where is a vector of exogenous variables. Unlike in 3

SLIDE 5

the linear regression model, may not necessarily be of the same dimension as . Since (|) is a nonlinear function of and , (*) is called the nonlinear regression

model. Assume ( ) i.i.d.;

so that: (; ) . Then we can write out a least square criterion function as below.

2. Criterion function: We choose criterion function as:

= ( (; ))2 = [(; 0) (; )]2 + 2

Then possess the property that it is minimized at =

0 (true parameter value). If = 0 is the only such value, model is identified (wrt criterion). 4

SLIDE 6

3. Analog in sample:

Pick: () = 1

X

=1

( (; ))2 as analog to in the sample. As established in the OLS case in the notes Asymptotic Theory II (Section 3.2), we can show that plim = .

4. The estimator: We construct the NLLS estimator as:

ˆ = argmin() Thus we chose ˆ to minimize (). In the next few sections, we establish consistency and asymptotic normality for the NLLS estimator (under certain conditions), and discuss conditions for asymptotic eciency. 5

SLIDE 7

1.2 Consistency of NLLS estimator

Assume:

1. i.i.d., () = 0 (2

) = 2 ;

2. 0 is a vector of unknown parameters;
3. Assume

0;

4. exists and is continuous in nbd of 0;

5. () is continuous in uniformly in (i.e., for every

0 there exists 0 such that |(1) (2)| for 1 2 closer than (i.e. ||1 2|| ), for all 1 2 in nbd of 0 and for all ); 6

SLIDE 8

6. 1
P

=1

(1)(2) converges uniformly in 1 2 in nbd of 0;

7. lim 1
P ((0) ())2 6= 0 if 6= 0.

Then, we have that there exists a unique root b such that: b = arg min X ( ( | ))2; and that it is consistent, i.e.: b

Proof: Amemiya p. 129. The proof is an application of the

Extremum Analogy Theorem for the class of estimators defined as b = arg min (). 7

SLIDE 9

1.3 Analogy with OLS estimator

Gallant (1975): Consider the NLLS model from (1) above: = ( | ) + Now expand in nbd of in Taylor series to get: = ( | ) + ( | ) ¯ ¯ ¯ ¯

( ) +

Rewrite the equation as: ( | ) + ( | ) ¯ ¯ ¯ ¯

= ( | )

+

8

SLIDE 10

Now by analogy with classical linear regression model, we have:

( | ) + ( | )

¯ ¯ ¯ ¯

is analogous to the

dependent variable in OLS.

( | )
is analogous to the independent variables ma-

trix in OLS. 9

SLIDE 11

The NLLS estimator is: b

=

X μ( | )

¶

μ( | ) ¶ ¸1 (2) × μX ( | )

¶

so that in comparison to the OLS estimator we have:

replaced by P

=1

Ã ˜

! Ã

˜

!

; and

replaced by P

=1

μ

¶

, where ˜ = (1 ) = (1 ). 10

SLIDE 12

Then analogy with OLS goes through exactly. Now, as for the OLS case, we can do hypothesis testing, etc., using derivatives in nbd of optimum. Using the analogy, we also obtain the estimator for Asy. var (b ) as: \

Asy. var(b

) = ˆ 2(˜

˜

)1 where ˜

= ( )

¯

¯ ¯ ¯ˆ

11

SLIDE 13

1.4 Asymptotic normality

To justify large sample normality, we need additional conditions on the model. The required conditions for asymptotic normality, assuming the conditions for consistency hold, are the following .

1. lim
1
P

=1

¯

¯ ¯ ¯

¯

¯ ¯ ¯ = a positive definite matrix;

2. 1
P

=1

0 converges uniformly to a finite matrix in an
pen nbd of 0;

3. 2 0 is continuous in in an open nbd of 0 uniformly ( · need uniform continuity of first and second partials); 12

SLIDE 14

4. lim
1

2

P

=1

2 ¸ = 0 for all in an open nbd of 0; and

5. 1
P

=1

(1) 2 ¯ ¯ ¯ ¯

2

converges to a finite matrix uniformly for all 1 2 in an open nbd of 0 Then:

(b

0) (0 2

1)

where 2

= ().

Sketch of proof (For rigorous proof, see Amemiya, p.132-4).The intuition for this result is exactly as in Cramer’s Theorem (refer to Section 2 of notes Asymptotic Theory III). 13

SLIDE 15

Look at first order condition:

= 2
X

=1

( ())

Then as in Cramer’s theorem (Theorem 3 in handout III) we

get:

¯

¯ ¯ ¯ = 1

X

()

¯

¯ ¯ ¯ 14

SLIDE 16

This is asymptotically normal (i.i.d. r.v.) by Lindeberg-Levy Central Limit Theorem. Then using equation (2) we obtain:

(b

0) = " 1

X

=1

μ( | )

¶ μ( | )

¶#1

× 1
X

=1

μ( | )

¶

() We get that this is asymptotically normal in nbd of 0, if £ 1

P ¡
¢ ¡

¢¤

converges uniformly to a non-singular ma-

trix (which is true by assumption). This completes the analogy with the Cramer’s theorem proved in earlier lecture. (See Amemiya for a rigorous derivation. Also, see the result in Gallant.) 15

SLIDE 17

1.5 Asymptotic eciency of NLLS estima- tor

Analogy of the NLLS estimator with is complete if we assume is normal. Then, we get the log likelihood function: ln $ = 2 ln 2 1 22

X

( ( | ))2 So that here we get b = b (FOC and asy. theory as before). 16

SLIDE 18

Thus we obtain the general result that any nonlinear regression model if we have that normal. Though the nonlinear regression is picking another criterion, the estimator is identical to the MLE estimator.

·

NLLS estimator is ecient in normal case.

In general, Greene (p. 305-8) shows that (unless is normal) NLLS is not necessarily asymptotically ecient. 17

SLIDE 19

1.6 Estimation of b

Now, consider the problem of numerical estimation: How to
btain b

? The two commonly used methods are:

i. Newton-Raphson; and
ii. Gauss-Newton.

18

SLIDE 20

1.6.1 Newton-Raphson Method In the NLLS case, we wish to find a solution to the equation: ()

= 0. This is true for many criteria outside of NLLS

(all criteria in Asymptotic Theory handout III). We expand the criterion function () in nbd of an initial starting value b

1, by a second order (quadratic) Taylor series

approximation to get: () ' (b

1)+
¯

¯ ¯ ¯

1

(b

1)+1

2(b

1)0 2

0(b

1) (3)

This quadratic problem has a solution if Hessian matrix 2 is a definite matrix (pos. def. for min). 19

SLIDE 21

In equation (3), we min () wrt (by taking the FOC) and

btain the algorithm:

b

2 = b
1

2 ¸1

1
¯

¯ ¯ ¯

1

We continue iteration until convergence occurs. The method assumes that we can approximate with a quadratic. Some the drawbacks of the method and possible fixes are discussed below. (A) Singular Hessian: There is a problem if the Hessian singular: the method fails as we are unable then to obtain 2 ¸1

1 .

20

SLIDE 22

In case the Hessian is singular, the following correction could be used : Use such that: 2 ¸

neg. def.

Usually we pick scalar (obviously can pick vectors). One can then fiddle with this to get out of nbd of local singularity. In applications of the Newton- Raphson method, one could use idea due to T.W. An- derson on reading list and note that asymptotically:

μ2 ln

¶ = μ ln

ln

¶ to arrive at an alternative estimator (sometimes called BHHH but method due to Anderson) for the Hessian. 21

SLIDE 23

(B) Algorithm Overshoots: In this case, one could scale the step back by : b

2 = b
1

2 ¸1

1
¯

¯ ¯ ¯

1

We choose 0 1 so that the iteration dierences get dampened, reducing chances of overshooting. 22

SLIDE 24

1.6.2 Gauss-Newton Method The motivation for the Gauss-Newton method mimics exactly the NLLS set up in section 1.3 where we drew the analogy with

OLS. Expanding in nbd of some initial starting value b
1, we

get: ( | b

1) + ( | b
1)
b
1 = ( | b
1)
b
2 + 2

This set-up is analogous to OLS; the LHS and part of RHS are data once one knows (guesses) the starting value b

1. Then do

OLS, to get the next iteration in the algorithm: b

2 =

" 1

X

=1

#1
1

1

X

=1

( | b

1)
23

SLIDE 25

so that we get: b

2 = b
1+

" 1

X

=1

#1
1

1

X

=1

( | b

1)
h

( | b

1)

i Revise, update, start all over again. This method has the same problems as in Newton-Raphson. (A) Singular Hessian: As in the Newton-Raphson method, to solve for optimum use:

0 +

¸ scalar. (B) Algorithm Overshoots: To avoid overshooting, use 24

SLIDE 26

Hartley modification. Form: 1 = " 1

X

=1

#1
1

1

X

=1

( | b

1)
h

( | b

1)

i

Then choose 0 1 such that:

(b

1 + 1) (b
1)

where () = P( (|))2. Update by setting b 2 = b 1 + 1. Then algorithm converges to a root of the equation. General Global convergence is a mess, un- resolved. 25

SLIDE 27

1.6.3 Eciency theorems for estimation methods Theorem 1 One Newton-Raphson Step toward an optimum is fully ecient if you start from an initial consistent estimator. This theorem suggests a strategy for quick convergence. Get cheap (low computational cost) estimator which is consistent but not ecient. Then iterate once — avoids computational

cost. (True also for Gauss-Newton). Note that here one must

use unmodified Hessians (without corrections for overshooting

r singularity).
Proof. Suppose b
1
0 and
(b
1 0)
(0

X

0)

It is consistent but not necessarily ecient. Now expand root 26

SLIDE 28

f likelihood equation in nbd of b
1 to get:

ln $

¯

¯ ¯ ¯

1 = ln $
¯

¯ ¯ ¯ + 2 ln $ ¯ ¯ ¯ ¯

1 (b

1 0)

b 1 does not necessarily set left hand side to zero. If it did, we would have an ecient estimator. As before 1 is intermediate value. 27

SLIDE 29

Now look at Newton-Raphson criterion. b

2 0

= (b

1 0)

2 ln $ ¸1

1

ln $

¸
1

= (b

1 0)

2 ln $ ¸1

1
ln $
¯

¯ ¯ ¯ + 2 ln $ ¯ ¯ ¯ ¯

(b

1 0)
Multiplying by
and collecting terms, we get:
³

b

2 0

´ =

1
2 ln $

¸1

1

1

ln $
¯

¯ ¯ ¯ + " 1

2 ln $

¸1

1

1

2 ln $

¯ ¯ ¯ ¯

1

#

³

b

1 0

´ 28

SLIDE 30

1
1

00

ln $

¯

¯ ¯ ¯ + h£ 1

0000

¤

³

b

1 0

´i The second term vanishes as

(ˆ
1 0) is 0(1). Therefore,
ne Newton-Raphson step satisfies likelihood equation at 0.

29

SLIDE 31

Same result obviously holds for Gauss-Newton. One G-N step for a consistent estimator is fully ecient (or at least as ecient as NLLS). Thus starting from a consistent estimator (where possible) saves computer time, avoids problems of nonlinear optimiza- tions and also avoids local optimization problem (i.e., possibil- ity of arriving at an inconsistent local optima). 30

SLIDE 32

2 “Durbin Problem”

Durbin’s problem is question of arriving at the correct variance- covariance matrix for a set of parameters estimated in the sec-

nd step of a two-step estimation procedure.

For example, let 0 = (¯ 1 ¯ 2), where ¯ 1 ¯ 2 are “true values”, as in the case with the composite hypothesis considered in the earlier lecture (Asymptotic theory IV). 31

SLIDE 33

Suppose we use an initial consistent estimator for 2. Then if we treat likelihood as if 2 known (but it is estimated by ˜ 2), we have: ln $

¯

¯ ¯ ¯

1 2

| {z }

=0

= 1

ln $

1 ¯ ¯ ¯ ¯ + 1

2 ln $

10

1

¯ ¯ ¯ ¯

³

b 1 1 ´ + 1

2 ln $

10

2

¯ ¯ ¯ ¯

³

˜ 2 2 ´ | {z }

“Durbin Problem”

We assume sample sizes the same in both samples. 32

SLIDE 34

which implies

³

b 1 ¯ 1 ´ = 1

11

1

ln $

1 ¯ ¯ ¯ ¯ 1

11 12

³

˜ 2 ¯ 2 ´

where ˜

$ is from the likelihood with sample size used to produce ˜ 2. 33

SLIDE 35

Thus to obtain the right covariance matrix for ˆ 1, we need covariance between the two score vectors. We have:

(˜

2 ¯ 2) = Ã 1

2 ˜

$ 20

2

!1 Ã ˜ $ 2 ! = 1

22

Ã ˜ $ 2 ! which implies

(ˆ

1 ¯ 1) = 1

11

1

ln $

1 ¯ ¯ ¯ ¯ 1

11121 22

1

˜

$ 2 ¯ ¯ ¯ ¯ ¯ 34

SLIDE 36

We need to compute covariance to get the right standard errors. Just form a new covariance matrix: (ˆ 1 ¯ 1) = 1

11 + 1 1112 ˜

1

22211 11

1

1112 ˜

22( ˆ 1 ˜ 2)1

11

1

11( ˆ

1 ˜ 2)˜ 22211

11

where (now we assume 2 dierent sample sizes): 35

SLIDE 37

1 = 1

ln $

1 ¯ ¯ ¯ ¯˜

1ˆ 2

where is the sample size for primary samples.

2 = 1 p ˜

ln ˜

$ 2 ¯ ¯ ¯ ¯ ¯

2

where ˜

is the sample size for samples used to get ˜ 2. In the independent sample case, we get that last two terms in (*) vanish, so that we get (ˆ 1 ¯ 1) = 1

11 + 1 1112 ˜

1

22211 11

36

SLIDE 38

2.1 Concentrated Likelihood Problem

This problem seems similar to, but is actually dierent from the Durbin problem. Here we have a log likelihood function which has two sets of parameters, ln $( ). In the first step here, we solve ln $ ( ) = 0 to get () We then optimize with respect to (). While this looks like the two-step estimator in Durbin’s problem, here we are not using an estimate of , but rather using (). In fact here we can show that this is the same as joint maximization of . 37

SLIDE 39

Using the Envelope Theorem (i.e., utilizing the fact that () is arrived at through an optimization), we get: ln $( ())

=

ln $( ())

2 ln $( ())

= 2 ln $( ()) +

2 ln $( ())
Now, we also have:
ln $( )
¸

= 2 ln $ + () 2 ln $ 0 = 0 =

=

2 ln $

μ2 ln $

¶1 38

SLIDE 40

Substituting into previous expression, we get: plim 1

2 ln $( ())

= ( 1

)

= The asymptotic distribution is the same for if we estimate jointly or through the concentrated likelihood approach. (Refer to Asymptotic Theory — Lecture IV, section on composite hypothesis for the distribution of sub-vector of parameters when estimation is done jointly.) 39