SLIDE 1 Lecture 3, Estimation and model validation
Erik Lindströmm
SLIDE 2
Maximum likelihood, recap
◮ Let x(N) = (x1, . . . , xN) be a sample from some
parametric class of models with known density fX(N)(x1, . . . , xn; Θ) = L(x(N); θ), where θ ∈ Θ is some unknown parameter vector.
◮ The Maximum Likelihood estimator (MLE) is
defined as ˆ θMLE = arg max
θ∈Θ
L(x(N); θ) (1)
◮ Taking logarithm does not change the
argument, so this is equivalently written as ˆ θMLE = arg max
θ∈Θ
ℓ(x(N); θ) (2) with ℓ(θ) = log L(x(N); θ).
SLIDE 3
◮ The asymptotic distribution for the MLE is given
by √ N ( ˆ θ − θ ) d → N ( 0, IF
N(θ)−1)
(3)
Theorem (Cramer-Rao)
Let T X1 XN be an unbiased estimator of . It then holds that V T XN IF
N 1
E log L x N
1
(4) E log L x N
2 1
(5) and the MLE attains this lower bound asymptotically.
SLIDE 4
◮ The asymptotic distribution for the MLE is given
by √ N ( ˆ θ − θ ) d → N ( 0, IF
N(θ)−1)
(3)
◮ Theorem (Cramer-Rao)
Let T(X1, . . . , XN) be an unbiased estimator of θ. It then holds that V(T(XN)) ≥ IF
N(θ)−1 = −
( E [ ∇θ∇θ log(L(x(N); θ)) ])−1 , (4) = ( E [( ∇θ log(L(x(N); θ)) )2])−1 (5) and the MLE attains this lower bound asymptotically.
SLIDE 5
Misspecified models
What happens if the model is wrong? We look at two simple cases
◮ The model is too simple ◮ The model is too complex
SLIDE 6
Too simple
◮ Assume that the data is given by
Y = [X Z] (θ β ) + ϵ (6)
◮ While the model is given by
Y = Xθ + ϵ. (7)
◮ What happens? Bias!
SLIDE 7
Proof, model is too simple
Estimate is given (in matrix notation) by ˆ θOLS = ( XTX )−1 XTY (8) Plug the expression for Y into that equation
OLS
XTX
1
XT X Z (9) XTX
1
XTX XTZ XT (10) bias + noise (11) Interpretation of the bias?
SLIDE 8
Proof, model is too simple
Estimate is given (in matrix notation) by ˆ θOLS = ( XTX )−1 XTY (8) Plug the expression for Y into that equation ˆ θOLS = ( XTX )−1 XT ( [X Z] (θ β ) + ϵ ) (9) = ( XTX )−1 ( XTXθ + XTZβ + XTϵ ) (10) = θ + bias + noise (11) Interpretation of the bias?
SLIDE 9
Model is too complex
◮ Assume that the data is given by
Y = Xθ + ϵ. (12)
◮ While the model is given by
Y = [X Z] (θ β ) + ϵ (13)
◮ What happens? No bias, but potentially poor
efficiency
SLIDE 10
Proof
Long and tedious (on the blackboard)
◮ Estimates are given by
ˆ (θ β ) = [XTX XTZ ZTX ZTZ ]−1 [XTXθ + XTϵ ZTXθ + ZTϵ ] (14)
◮ Use the Woodbury identity
[A U V C ]−1 = [A−1 + A−1UΩ−1VA−1 −A−1UΩ−1 −Ω−1VA−1 Ω−1 ] (15) with Ω = (C − VA−1U)
◮ It then follow that θ is unbiased and E[ˆ
β] = 0.
SLIDE 11
Examination of the data
Before starting to do any estimation we should carefully look at the dataset.
◮ Is the data correct? Most orders never result in a
trade...
◮ Does the data contain outliers? ◮ Missing values? ◮ Do we have measurements of all relevant
explanatory variables?
◮ Timing errors?
SLIDE 12
Model validation
There are two types of validation. Absolute: Are the model assumptions fulfilled? Relative: Is the estimated model good enough, compared to some other model. Both can still be wrong...
SLIDE 13
Model validation
There are two types of validation. Absolute: Are the model assumptions fulfilled? Relative: Is the estimated model good enough, compared to some other model. Both can still be wrong...
SLIDE 14
Model validation
There are two types of validation. Absolute: Are the model assumptions fulfilled? Relative: Is the estimated model good enough, compared to some other model. Both can still be wrong...
SLIDE 15
Absolute tests
We have some external knowledge of data e.g. underlying physics (Gray box models).
◮ Looking at whether the estimated parameters
make sense.
◮ Are effects going in the right directions? ◮ Do the parameters have reasonable values?
SLIDE 16
Residuals
The residuals {e} should be i.i.d. Why? This implies: No auto-dependence Cov f en g en
k
k f g such that E f e 2 E g e 2 . No cross-dependence: Cov f en g un
k
k f g such that E f e 2 E g u 2 where u is some external signal used as explanatory variable.
SLIDE 17
Residuals
The residuals {e} should be i.i.d. Why? This implies:
◮ No auto-dependence
Cov(f(en), g(en+k)) = 0, ∀k ∈ Z, ∀f, g, such that E[f(e)2] < ∞, E[g(e)2] < ∞. No cross-dependence: Cov f en g un
k
k f g such that E f e 2 E g u 2 where u is some external signal used as explanatory variable.
SLIDE 18
Residuals
The residuals {e} should be i.i.d. Why? This implies:
◮ No auto-dependence
Cov(f(en), g(en+k)) = 0, ∀k ∈ Z, ∀f, g, such that E[f(e)2] < ∞, E[g(e)2] < ∞.
◮ No cross-dependence:
Cov(f(en), g(un+k)) = 0, ∀k ∈ Z, ∀f, g, such that E[f(e)2] < ∞, E[g(u)2] < ∞ where u is some external signal used as explanatory variable.
SLIDE 19
Normalized prediction errors
Residuals are usually normalized prediction errors en = yn − E[Yn|Fn−1] √ V(Yn|Fn−1). This can in many cases also be generalized to SDE-models.
SLIDE 20
Formal tests
◮ Test for dependence in residuals (Box-Ljung).
T = N(N + 2)
p
∑
k=1
γ(k)2 N − k. Reject if T > χ2
1−α,p.
Signtest on residuals # of positive Bin N 1 2 . Number of changes of sign (Wald-Wolfowitz runs test) Resimulate the model from residuals. Can it reproduce data?
SLIDE 21
Formal tests
◮ Test for dependence in residuals (Box-Ljung).
T = N(N + 2)
p
∑
k=1
γ(k)2 N − k. Reject if T > χ2
1−α,p. ◮ Signtest on residuals # of positive ∈ Bin(N, 1/2). ◮ Number of changes of sign (Wald-Wolfowitz
runs test)
◮ Resimulate the model from residuals. Can it
reproduce data?
SLIDE 22
Scatterplots of residuals
◮ en vs en−1 (autocorr). ◮ en vs yn|n−1 = E[yn|Fn−1] prediction error-
remaining auto dependence.
◮ en vs un external dependence.
SLIDE 23 A good example (a well estimated AR(1) process)
en−1 vs en en vs yn|n−1
−0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4 −0.5 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4 0.5 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4
SACF Normplot
2 4 6 8 10 12 14 16 18 20 −0.2 0.2 0.4 0.6 0.8 1 1.2 lag −0.3 −0.2 −0.1 0.1 0.2 0.3 0.001 0.003 0.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997 0.999 Data Probability Normal Probability Plot
SLIDE 24 An example of wrong order (an AR(2) model estimated with a AR(1) model)
en−1 vs en en vs yn|n−1
−0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4 −1.5 −1 −0.5 0.5 1 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4
SACF Normplot
2 4 6 8 10 12 14 16 18 20 −0.2 0.2 0.4 0.6 0.8 1 1.2 lag −0.3 −0.2 −0.1 0.1 0.2 0.3 0.001 0.003 0.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997 0.999 Data Probability Normal Probability Plot
SLIDE 25 An example of wrong model structure (a non-linear model estimated with a AR(1) model)
en−1 vs en en vs yn|n−1
−1.5 −1 −0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 −1.5 −1 −0.5 0.5 1 1.5
SACF Normplot
2 4 6 8 10 12 14 16 18 20 −0.2 0.2 0.4 0.6 0.8 1 1.2 lag −1 −0.5 0.5 1 0.001 0.003 0.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997 0.999 Data Probability Normal Probability Plot
SLIDE 26 Overfitting
Overfitting gives residuals that look good. Therefore it is important to test predictions also out
◮ Split data into an estimation and a validation set. ◮ Cross validation
SLIDE 27 Example overfitting (ARMA(1,1) fitted with ARMA(3,3))
en−1 vs en in sample SACF in sample
−3 −2 −1 1 2 3 4 −3 −2 −1 1 2 3 4 2 4 6 8 10 12 14 16 18 20 −0.2 0.2 0.4 0.6 0.8 1 1.2 lag
en−1 vs en out of sample SACF out of sample
−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 2 4 6 8 10 12 14 16 18 20 −0.5 0.5 1 lag
SLIDE 28
Relative model validation
Test if a larger model is necessary. H0 : θ′ = θ′ H1 : θ′ free. Hypothesis test: Wald, LM or LR. Wald: Iˆ
θ = ˆ
θ ± λα/2d(ˆ θ)
SLIDE 29 LR for Gaussian models
Let Q(n) be the sum of squared residuals for an estimated model with n parameters from N
Test n1 vs n2 parameters, then for true order n0 ≤ n1 < n2 i)
Q(n2) σ2
∈ χ2(N − n2). ii)
Q(n1)−Q(n2) σ2
∈ χ2(n2 − n1). iii) Q(n2) and Q(n1) − Q(n2) are independent. iv) η = N−n2
n2−n1 Q(n1)−Q(n2) Q(n2)
∈ F(n2 − n1, N − n2). If η is large pick model 2 else pick model 1. This is an exact test for AR models.
SLIDE 30
Asymptotic tests
LR = −2 ( log(L(θModel1) − log(L(θModel2) ) If model 1 has n1 parameters and model 2 has n2 parameters n2 > n1 then LR is asymptotically distributed as LR d → χ2(n2 − n1). (16) This is true for all models where the likelihood regularity conditions apply (a very large class of distributions) if N is large. This is the most powerful test in the sense of Neyman-Pearson. Note: Compare apples with apples, cf. AR(p) processes.
SLIDE 31
Asymptotic tests
LR = −2 ( log(L(θModel1) − log(L(θModel2) ) If model 1 has n1 parameters and model 2 has n2 parameters n2 > n1 then LR is asymptotically distributed as LR d → χ2(n2 − n1). (16) This is true for all models where the likelihood regularity conditions apply (a very large class of distributions) if N is large. This is the most powerful test in the sense of Neyman-Pearson. Note: Compare apples with apples, cf. AR(p) processes.
SLIDE 32 Information criteria based choices of model order
The main idea is to penalize too many parameters.
◮ AIC (Akaikes Information Criteria):
−2 log(L(θ)) + 2dim(θ). Often overestimates the model order, but chooses the best predictor
◮ BIC (Bayesian information criteria):
−2 log(L(θ)) + 2dim(θ) log(N). Finds the correct model order asymptotically, but may
- verpenalize for small samples.
◮ Alternative LIL (law of iterated logarithm):
−2 log(L(θ)) + 2dim(θ) log(log(N)).
SLIDE 33
Example choice of model AR(3) process
The number of observations is 500 the number of replicates is 200 crit/order 1 2 3 4 5 6 7 8 9 ≥ 10 AIC 34 13 11 15 6 8 12 101 BIC 78 122 LIL 4 140 29 7 6 7 3 2 2
SLIDE 34