[PPT] - Lecture 3, Estimation and model validation Erik Lindstrmm Maximum PowerPoint Presentation

SLIDE 1

Lecture 3, Estimation and model validation

Erik Lindströmm

SLIDE 2

Maximum likelihood, recap

◮ Let x(N) = (x1, . . . , xN) be a sample from some

parametric class of models with known density fX(N)(x1, . . . , xn; Θ) = L(x(N); θ), where θ ∈ Θ is some unknown parameter vector.

◮ The Maximum Likelihood estimator (MLE) is

defined as ˆ θMLE = arg max

θ∈Θ

L(x(N); θ) (1)

◮ Taking logarithm does not change the

argument, so this is equivalently written as ˆ θMLE = arg max

θ∈Θ

ℓ(x(N); θ) (2) with ℓ(θ) = log L(x(N); θ).

SLIDE 3

◮ The asymptotic distribution for the MLE is given

by √ N ( ˆ θ − θ ) d → N ( 0, IF

N(θ)−1)

(3)

Theorem (Cramer-Rao)

Let T X1 XN be an unbiased estimator of . It then holds that V T XN IF

N 1

E log L x N

1

(4) E log L x N

2 1

(5) and the MLE attains this lower bound asymptotically.

SLIDE 4

◮ The asymptotic distribution for the MLE is given

by √ N ( ˆ θ − θ ) d → N ( 0, IF

N(θ)−1)

(3)

◮ Theorem (Cramer-Rao)

Let T(X1, . . . , XN) be an unbiased estimator of θ. It then holds that V(T(XN)) ≥ IF

N(θ)−1 = −

( E [ ∇θ∇θ log(L(x(N); θ)) ])−1 , (4) = ( E [( ∇θ log(L(x(N); θ)) )2])−1 (5) and the MLE attains this lower bound asymptotically.

SLIDE 5

Misspecified models

What happens if the model is wrong? We look at two simple cases

◮ The model is too simple ◮ The model is too complex

SLIDE 6

Too simple

◮ Assume that the data is given by

Y = [X Z] (θ β ) + ϵ (6)

◮ While the model is given by

Y = Xθ + ϵ. (7)

◮ What happens? Bias!

SLIDE 7

Proof, model is too simple

Estimate is given (in matrix notation) by ˆ θOLS = ( XTX )−1 XTY (8) Plug the expression for Y into that equation

OLS

XTX

1

XT X Z (9) XTX

1

XTX XTZ XT (10) bias + noise (11) Interpretation of the bias?

SLIDE 8

Proof, model is too simple

Estimate is given (in matrix notation) by ˆ θOLS = ( XTX )−1 XTY (8) Plug the expression for Y into that equation ˆ θOLS = ( XTX )−1 XT ( [X Z] (θ β ) + ϵ ) (9) = ( XTX )−1 ( XTXθ + XTZβ + XTϵ ) (10) = θ + bias + noise (11) Interpretation of the bias?

SLIDE 9

Model is too complex

◮ Assume that the data is given by

Y = Xθ + ϵ. (12)

◮ While the model is given by

Y = [X Z] (θ β ) + ϵ (13)

◮ What happens? No bias, but potentially poor

efficiency

SLIDE 10

Proof

Long and tedious (on the blackboard)

◮ Estimates are given by

ˆ (θ β ) = [XTX XTZ ZTX ZTZ ]−1 [XTXθ + XTϵ ZTXθ + ZTϵ ] (14)

◮ Use the Woodbury identity

[A U V C ]−1 = [A−1 + A−1UΩ−1VA−1 −A−1UΩ−1 −Ω−1VA−1 Ω−1 ] (15) with Ω = (C − VA−1U)

◮ It then follow that θ is unbiased and E[ˆ

β] = 0.

SLIDE 11

Examination of the data

Before starting to do any estimation we should carefully look at the dataset.

◮ Is the data correct? Most orders never result in a

trade...

◮ Does the data contain outliers? ◮ Missing values? ◮ Do we have measurements of all relevant

explanatory variables?

◮ Timing errors?

SLIDE 12

Model validation

There are two types of validation. Absolute: Are the model assumptions fulfilled? Relative: Is the estimated model good enough, compared to some other model. Both can still be wrong...

SLIDE 13

Model validation

There are two types of validation. Absolute: Are the model assumptions fulfilled? Relative: Is the estimated model good enough, compared to some other model. Both can still be wrong...

SLIDE 14

Model validation

There are two types of validation. Absolute: Are the model assumptions fulfilled? Relative: Is the estimated model good enough, compared to some other model. Both can still be wrong...

SLIDE 15

Absolute tests

We have some external knowledge of data e.g. underlying physics (Gray box models).

◮ Looking at whether the estimated parameters

make sense.

◮ Are effects going in the right directions? ◮ Do the parameters have reasonable values?

SLIDE 16

Residuals

The residuals {e} should be i.i.d. Why? This implies: No auto-dependence Cov f en g en

k

k f g such that E f e 2 E g e 2 . No cross-dependence: Cov f en g un

k

k f g such that E f e 2 E g u 2 where u is some external signal used as explanatory variable.

SLIDE 17

Residuals

The residuals {e} should be i.i.d. Why? This implies:

◮ No auto-dependence

Cov(f(en), g(en+k)) = 0, ∀k ∈ Z, ∀f, g, such that E[f(e)2] < ∞, E[g(e)2] < ∞. No cross-dependence: Cov f en g un

k

k f g such that E f e 2 E g u 2 where u is some external signal used as explanatory variable.

SLIDE 18

Residuals

The residuals {e} should be i.i.d. Why? This implies:

◮ No auto-dependence

Cov(f(en), g(en+k)) = 0, ∀k ∈ Z, ∀f, g, such that E[f(e)2] < ∞, E[g(e)2] < ∞.

◮ No cross-dependence:

Cov(f(en), g(un+k)) = 0, ∀k ∈ Z, ∀f, g, such that E[f(e)2] < ∞, E[g(u)2] < ∞ where u is some external signal used as explanatory variable.

SLIDE 19

Normalized prediction errors

Residuals are usually normalized prediction errors en = yn − E[Yn|Fn−1] √ V(Yn|Fn−1). This can in many cases also be generalized to SDE-models.

SLIDE 20

Formal tests

◮ Test for dependence in residuals (Box-Ljung).

T = N(N + 2)

p

∑

k=1

γ(k)2 N − k. Reject if T > χ2

1−α,p.

Signtest on residuals # of positive Bin N 1 2 . Number of changes of sign (Wald-Wolfowitz runs test) Resimulate the model from residuals. Can it reproduce data?

SLIDE 21

Formal tests

◮ Test for dependence in residuals (Box-Ljung).

T = N(N + 2)

p

∑

k=1

γ(k)2 N − k. Reject if T > χ2

1−α,p. ◮ Signtest on residuals # of positive ∈ Bin(N, 1/2). ◮ Number of changes of sign (Wald-Wolfowitz

runs test)

◮ Resimulate the model from residuals. Can it

reproduce data?

SLIDE 22

Scatterplots of residuals

◮ en vs en−1 (autocorr). ◮ en vs yn|n−1 = E[yn|Fn−1] prediction error-

remaining auto dependence.

◮ en vs un external dependence.

SLIDE 23

A good example (a well estimated AR(1) process)

en−1 vs en en vs yn|n−1

−0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4 −0.5 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4 0.5 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4

SACF Normplot

2 4 6 8 10 12 14 16 18 20 −0.2 0.2 0.4 0.6 0.8 1 1.2 lag −0.3 −0.2 −0.1 0.1 0.2 0.3 0.001 0.003 0.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997 0.999 Data Probability Normal Probability Plot

SLIDE 24

An example of wrong order (an AR(2) model estimated with a AR(1) model)

en−1 vs en en vs yn|n−1

−0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4 −1.5 −1 −0.5 0.5 1 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4

SACF Normplot

2 4 6 8 10 12 14 16 18 20 −0.2 0.2 0.4 0.6 0.8 1 1.2 lag −0.3 −0.2 −0.1 0.1 0.2 0.3 0.001 0.003 0.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997 0.999 Data Probability Normal Probability Plot

SLIDE 25

An example of wrong model structure (a non-linear model estimated with a AR(1) model)

en−1 vs en en vs yn|n−1

−1.5 −1 −0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 −1.5 −1 −0.5 0.5 1 1.5

SACF Normplot

2 4 6 8 10 12 14 16 18 20 −0.2 0.2 0.4 0.6 0.8 1 1.2 lag −1 −0.5 0.5 1 0.001 0.003 0.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997 0.999 Data Probability Normal Probability Plot

SLIDE 26

Overfitting

Overfitting gives residuals that look good. Therefore it is important to test predictions also out

f sample.

◮ Split data into an estimation and a validation set. ◮ Cross validation

SLIDE 27

Example overfitting (ARMA(1,1) fitted with ARMA(3,3))

en−1 vs en in sample SACF in sample

−3 −2 −1 1 2 3 4 −3 −2 −1 1 2 3 4 2 4 6 8 10 12 14 16 18 20 −0.2 0.2 0.4 0.6 0.8 1 1.2 lag

en−1 vs en out of sample SACF out of sample

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 2 4 6 8 10 12 14 16 18 20 −0.5 0.5 1 lag

SLIDE 28

Relative model validation

Test if a larger model is necessary. H0 : θ′ = θ′ H1 : θ′ free. Hypothesis test: Wald, LM or LR. Wald: Iˆ

θ = ˆ

θ ± λα/2d(ˆ θ)

SLIDE 29

LR for Gaussian models

Let Q(n) be the sum of squared residuals for an estimated model with n parameters from N

bservations.

Test n1 vs n2 parameters, then for true order n0 ≤ n1 < n2 i)

Q(n2) σ2

∈ χ2(N − n2). ii)

Q(n1)−Q(n2) σ2

∈ χ2(n2 − n1). iii) Q(n2) and Q(n1) − Q(n2) are independent. iv) η = N−n2

n2−n1 Q(n1)−Q(n2) Q(n2)

∈ F(n2 − n1, N − n2). If η is large pick model 2 else pick model 1. This is an exact test for AR models.

SLIDE 30

Asymptotic tests

LR = −2 ( log(L(θModel1) − log(L(θModel2) ) If model 1 has n1 parameters and model 2 has n2 parameters n2 > n1 then LR is asymptotically distributed as LR d → χ2(n2 − n1). (16) This is true for all models where the likelihood regularity conditions apply (a very large class of distributions) if N is large. This is the most powerful test in the sense of Neyman-Pearson. Note: Compare apples with apples, cf. AR(p) processes.

SLIDE 31

Asymptotic tests

LR = −2 ( log(L(θModel1) − log(L(θModel2) ) If model 1 has n1 parameters and model 2 has n2 parameters n2 > n1 then LR is asymptotically distributed as LR d → χ2(n2 − n1). (16) This is true for all models where the likelihood regularity conditions apply (a very large class of distributions) if N is large. This is the most powerful test in the sense of Neyman-Pearson. Note: Compare apples with apples, cf. AR(p) processes.

SLIDE 32

Information criteria based choices of model order

The main idea is to penalize too many parameters.

◮ AIC (Akaikes Information Criteria):

−2 log(L(θ)) + 2dim(θ). Often overestimates the model order, but chooses the best predictor

◮ BIC (Bayesian information criteria):

−2 log(L(θ)) + 2dim(θ) log(N). Finds the correct model order asymptotically, but may

verpenalize for small samples.

◮ Alternative LIL (law of iterated logarithm):

−2 log(L(θ)) + 2dim(θ) log(log(N)).

SLIDE 33

Example choice of model AR(3) process

The number of observations is 500 the number of replicates is 200 crit/order 1 2 3 4 5 6 7 8 9 ≥ 10 AIC 34 13 11 15 6 8 12 101 BIC 78 122 LIL 4 140 29 7 6 7 3 2 2

SLIDE 34