STATISTICS 479/503 TIME SERIES ANALYSIS PART II Doug Wiens April 12, 2005
STATISTICS 479/503 TIME SERIES ANALYSIS PART II Doug Wiens April - - PDF document
STATISTICS 479/503 TIME SERIES ANALYSIS PART II Doug Wiens April - - PDF document
STATISTICS 479/503 TIME SERIES ANALYSIS PART II Doug Wiens April 12, 2005 Contents II Time Domain Analysis 38 5 Lecture 5 . . . . . . . . . . . . . . . . . . . 39 6 Lecture 6 . . . . . . . . . . . . . . . . . . . 47 7 Lecture 7 . . .
Contents
II Time Domain Analysis 38
5 Lecture 5 . . . . . . . . . . . . . . . . . . . 39 6 Lecture 6 . . . . . . . . . . . . . . . . . . . 47 7 Lecture 7 . . . . . . . . . . . . . . . . . . . 56 8 Lecture 8 . . . . . . . . . . . . . . . . . . . 62 9 Lecture 9 . . . . . . . . . . . . . . . . . . . 71
37 10 Lecture 10 . . . . . . . . . . . . . . . . . . 78 11 Lecture 11 . . . . . . . . . . . . . . . . . . 87 12 Lecture 12 . . . . . . . . . . . . . . . . . . 97 13 Lecture 13 . . . . . . . . . . . . . . . . . . 115 14 Lecture 14 (Review) . . . . . . . . . . . . . 126
38
Part II Time Domain Analysis
39 5. Lecture 5
- Suppose {Xt} is (weakly) stationary (and non-
deterministic - Xt is not a non-random function
- f Xt−1, Xt−2, ..., i.e. it cannot be predicted ex-
actly from the past). Write Xt −µ as ˙
- Xt. Then
(Wold’s Representation Theorem) we can repre- sent ˙ Xt as ˙ Xt = wt + ψ1wt−1 + ψ2wt−2 + ... =
∞
X
k=0
ψkwt−k (with ψ0 = 1) (1) and
∞
X
k=1
ψ2
k < ∞.
Salient feature: Linear function of past and present (not future) disturbances. Interpretation: con- vergence in mean square; i.e. E
⎡ ⎢ ⎣ ⎛ ⎝ ˙
Xt −
K
X
k=0
ψkwt−k
⎞ ⎠
2⎤
⎥ ⎦ → 0 as K → ∞.
40 — The conditions ensure that we can take term- by-term expectations of series of the form
P∞
k=0 ψkXt−k, if the Xt−k have expectations:
E
⎡ ⎣
∞
X
k=0
ψkXt−k
⎤ ⎦ =
∞
X
k=0
ψkE
£Xt−k ¤ .
- If (1) holds we say {Xt} is a linear process (also
called causal in the text, i.e. doesn’t depend on the future). Thus Wold’s Representation Theo- rem can be interpreted as saying that Stationarity ⇒ Linearity.
41 The converse holds: Assume {Xt} linear; then (i) E [Xt] = µ + E
h ˙
Xt
i
= µ +
∞
X
k=0
ψkE
£wt−k ¤ = µ.
(ii) COV [Xt, Xt+m] = E[ ˙ Xt ˙ Xt+m] = E
⎡ ⎣
∞
X
k=0
ψkwt−k
∞
X
l=0
ψlwt+m−l
⎤ ⎦
=
∞
X
k=0 ∞
X
l=0
ψkψlE
£wt−kwt+m−l ¤
=
∞
X
k=0 ∞
X
l=0
ψkψl
n
σ2
wI (l = m + k)
- = σ2
w ∞
X
k=0
ψkψk+m. (In particular, V AR [Xt] = σ2
w
P∞
k=0 ψ2 k < ∞.)
Thus Stationarity ⇔ Linearity.
42
- Backshift operator:
B (Xt) = Xt−1, B2 (Xt) = B ◦ B (Xt) = B (Xt−1) = Xt−2,
- etc. Then {Xt} linear ⇒
˙ Xt = ψ(B)wt for the characteristic polynomial ψ(B) = 1 + ψ1B + ψ2B2 + .... This is not really a polynomial, but if it is, i.e. ψk = 0 for k > q, we say {Xt} is a moving average series of order q, written MA(q). We usually write ψk = −θk. Then ˙ Xt = wt−θ1wt−1−θ2wt−2−...−θqwt−q = θ(B)wt for θ(B) = 1 − θ1B − ... − θqBq, the MA(q) characteristic polynomial.
- The above convention, with θ(B) = 1 − θ1B −
... −θqBq, is as in ASTSA, and is consistent with an earlier version of the text and many other texts.
43 The current version of the text uses ˙ Xt = wt + θ1wt−1 + θ2wt−2 + ... + θqwt−q and θ(B) = 1 + θ1B + ... + θqBq. To be consistent with ASTSA I’ll use the former.
- Invertibility:
{Xt} is invertible if it can be rep- resented as ˙ Xt = φ1 ˙ Xt−1+φ2 ˙ Xt−2+...+wt, where
∞
X
k=1
|φk| < ∞. Thus, apart from some noise, Xt is a function of the past history of the process. Generally, only invertible processes are of practical interest. In terms of the backshift operator, wt = ˙ Xt − φ1 ˙ Xt−1 − φ2 ˙ Xt−2 − ... = φ(B) ˙ Xt, where φ(B) = 1 − φ1B − φ2B2 − ... is the char- acteristic polynomial. If it is a true polynomial, i.e. if φj = 0 for j > p, we say {Xt} is an autore- gressive process of order p, i.e. AR(p). Then ˙ Xt = φ1 ˙ Xt−1 + φ2 ˙ Xt−2 + ... + φp ˙ Xt−p + wt.
44 — When P∞
k=1 |φk| < ∞ we say the series is
absolutely summable. The importance of ab- solute summability is that such series can be re-arranged - they can be summed in any or- der. In contrast, P∞
k=1 (−1)k+1 k
= ln 2 ≈ .69, but the series is not absolutely summable:
P∞
k=1 1 k = ∞.
The original series can be re- arranged to give just about anything; for in- stance
µ
1 + 1 3 − 1 2
¶
+
µ1
5 + 1 7 − 1 4
¶
+ · · ·, in which two positive terms are always followed by a negative one, converges to something > .8.
- When is a stationary (i.e. linear) process invert-
ible? Let {Xt} be linear, so ˙ Xt = ψ(B)wt and P∞
k=1 ψ2 k < ∞.
Suppose it is invertible. Then φ(B) ˙ Xt = wt; thus φ(B)ψ(B)wt = wt and φ(B)ψ(B) = 1. Thus φ(B) = 1/ψ(B), i.e. 1/ψ(B) has a power series expansion with absolutely summable coefficients. This makes ψ(B) quite special.
45 — Example: MA(1); ψ(B) = 1 − θB for some θ. Then if invertible we must have 1/ψ(B) = 1 + θB + θ2B2 + ... =
∞
X
j=0
θjBj; AND 1 + |θ| +
¯ ¯ ¯θ2¯ ¯ ¯ + ... < ∞;
this last point holds iff |θ| < 1. Note that the root of θ(B) = 0 is B = 1/θ, and then |θ| < 1 ⇔ |B| > 1, i.e. the MA(1) process with θ(B) = 1 − θB is invertible iff the root
- f θ(B) = 0 satisfies |B| > 1.
— In general, a linear process Xt = ψ(B)wt is in- vertible iff all roots of the characteristic equa- tion ψ(B) = 0 satisfy |B| > 1 (complex modulus), i.e. they “lie outside the unit circle in the com- plex plane”. — The modulus of a complex number z = a + ib is |z| =
q
a2 + b2 (like the norm of a vector with coordinates (a, b)).
46 — e.g. Xt = wt − 2wt−1 + 2wt−2; ψ(B) = 1 − 2B + 2B2 = 0 for B = .5 ± .5i; |B| = 1/ √ 2 ≈ .7 < 1. Non-invertible. — Similarly, an invertible process is stationary iff all roots of φ(B) = 0 lie outside the unit circle. This is called the stationarity condition. E.g. for an AR(1) the stationarity condition is |φ| < 1.
47 6. Lecture 6
- Review of previous lecture. Assume now for sim-
plicity that the mean is zero: — Linear process: Xt = ψ(B)wt, ψ(B) = 1 + ψ1B +ψ2B2+... with P∞
k=0 ψ2 k < ∞. Then
γ(m) = σ2
w ∞
X
k=0
ψkψk+m. — Linear + “ψk = 0 for k > q”: MA(q) process, γ(m) = 0 for m > q. Characteristic polyno- mial written as θ(B) = 1 − θ1B − θ2B2 − ... − θqBq. — Invertible process: φ(B)Xt = wt, φ(B) = 1 − φ1B − φ2B2 − ... with P
k |φk| < ∞.
Note this is really ˙ Xt; a non-zero mean can be accommodated as follows: wt = φ(B) ˙ Xt = ˙ Xt − φ1 ˙ Xt−1 − ... = (Xt − µ) − φ1 (Xt−1 − µ) − ... = {Xt − φ1Xt−1 − ...} − µ {1 − φ1 − φ2 − ...} = φ(B)Xt − α,
48 if α = µφ(1). — Invertible + “φj = 0 for j > p”: AR(p) process. — Wold’s Theorem: Stationary ⇔ Linear. — A stationary process is invertible iff all roots
- f ψ(B) = 0 lie outside the unit circle.
Thus an MA(q) is stationary (linear), not necessarily invertible. — An invertible process is stationary iff all roots
- f φ(B) = 0 lie outside the unit circle.
Thus an AR(p) is invertible, not necessarily station- ary.
- Example: MA(2).
Xt = wt − θ1wt−1 − θ2wt−2, θ(B) = 1 − θ1B − θ2B2. If θ2
1 + 4θ2 < 0 (so both roots are complex), then
invertibility requires |θ2| < 1. Suppose this is
49 so. To invert: we need θ(B)φ(B) = 1, where φ(B) = 1 − φ1B − φ2B2 − ... , so 1 =
⎧ ⎨ ⎩ h
1 − θ1B − θ2B2i ·
h
1 − φ1B − φ2B2 − ... − φkBk − ...
i ⎫ ⎬ ⎭
= 1 − (φ1 + θ1) B − (φ2 − θ1φ1 + θ2) B2 − ... −
¡φk − φk−1θ1 − φk−2θ2 ¢ Bk + ... .
Matching coefficients: φ1 = −θ1, φ2 = θ1φ1 − θ2 = −θ2
1 − θ2,
φk = φk−1θ1 + φk−2θ2, k = 2, 3, ... .
50
- ARMA models are defined in operator notation
by φ(B)Xt = θ(B)wt; if φ(B) is an AR(p) char- acteristic polynomial and θ(B) an MA(q), we say {Xt} is an ARMA(p,q) process. It is station- ary (linear, causal) if Xt = ψ(B)wt for a series ψ(z) = P ψkzk, |z| ≤ 1, with square summable coefficients. Then the coefficients ψk are deter- mined from θ(z)/φ(z) = ψ(z). It can be shown that ψ(z) has the required properties only if all zeros of φ(z) lie outside the unit circle. Simi- larly an ARMA(p,q) is invertible only if all zeros
- f θ(z) lie outside the unit circle. We also require
that the polynomials have no common factors.
- Example: (Example 2.6 in text)
Xt = .4Xt−1 + .45Xt−2 + wt + wt−1 + .25wt−2 ⇒
³
1 − .4B − .45B2´ Xt =
³
1 + B + .25B2´ wt ⇒ (1 − .9B) (1 + .5B) Xt = (1 + .5B) (1 + .5B) wt ⇒ (1 − .9B) Xt = (1 + .5B) wt.
51 Thus series is both stationary and invertible. It is ARMA(1,1), not ARMA(2,2) as it initially ap- peared. Students should verify that the above can be continued as Xt =
⎡ ⎣
∞
X
j=0
(.9)j Bj · (1 + .5B)
⎤ ⎦ wt
=
h
1 + (.9 + .5)B + ... + (.9)j−1(.9 + .5)Bj + ...
i
wt = ψ(B)wt where ψ(z) = P ψkzk and ψ0 = 1, ψk = 1.4(.9)k−1.
- Box-Jenkins methodology:
- 1. Determine the theoretical ACF (and PACF, to
be defined) for these and other classes of time series models. Use the sample ACF/PACF to match the data to a possible model (MA(q), AR(p), etc.).
52
- 2. Estimate parameters using a method appropri-
ate to the chosen model, assess the fit, study the residuals. The notion of residual will re- quire a special treatment, for now think of them as Xt − ˆ Xt where, e.g., in an AR(p) model (Xt = φ1Xt−1+φ2Xt−2+...+φpXt−p+ wt) we have ˆ Xt = ˆ φ1Xt−1 + ˆ φ2Xt−2 + ... + ˆ φpXt−p. The residuals should then “look like” white noise (why?). If the fit is inad- equate revise steps 1. and 2.
- 3. Finally use model to forecast.
- We treat these three steps in detail.
Recall that for an MA(q), the autocovariance function is γ(m) =
(
σ2
w
Pq−m
k=0 θkθk+m, 0 ≤ m ≤ q,
m > q. The salient feature is that γ(m) = 0 for m > q; we look for this in the sample ACF. See Figures 2.1, 2.2.
53 Figure 2.1. Sample ACF of simulated MA(3) series. Figure 2.2. Sample ACF of simulated MA(3) series.
54
- ACF of an AR(p) process:
Let j ≥ 0; assume process is stationary. Then wt = Xt −
p
X
i=1
φiXt−i ⇒ COV
⎡ ⎣Xt −
p
X
i=1
φiXt−i, Xt−j
⎤ ⎦
= COV
h
wt, Xt−j
i
⇒ γ(j) −
p
X
i=1
φiγ(j − i) = COV
h
wt, Xt−j
i
. Under the stationarity condition, Xt−j is a linear combination wt−j + ψ1wt−j−1 + ψ2wt−j−2 + .. with COV
h
wt, Xt−j
i
= COV
"
wt, wt−j + ψ1wt−j−1 +ψ2wt−j−2 + ..
#
= σ2
wI(j = 0),
thus γ(j) −
p
X
i=1
φiγ(j − i) =
(
σ2
w, j = 0,
j > 0. These are the “Yule-Walker” equations to be solved to obtain γ(j) for j ≥ 0, then γ(−j) = γ(j).
55
- Example: AR(1).
Yule-Walker equations are γ(j) − φγ(j − 1) =
(
σ2
w, j = 0,
j > 0. We get γ(0) = φγ(1) + σ2
w,
γ(j) = φγ(j − 1) for j > 0. In particular γ(0) = φγ(1) + σ2
w
= φ (φγ(0)) + σ2
w,
so γ(0) = σ2
w
1 − φ2. Note that 0 < γ(0) = V AR[Xt] < ∞ by the stationarity condition |φ| < 1. Iterating γ(j) = φγ(j − 1) gives γ(j) = φjγ(0), j = 1, 2, 3, ... . Thus ρ(j) = γ(j) γ(0) = φ|j|.
56 7. Lecture 7
- Difficult to identify an AR(p) from its ACF.
- Suppose that a series is AR(1), and consider fore-
casting Xt from two previous values Xt−1,Xt−2: Xt = φXt−1 + wt, ˆ Xt = α1Xt−1 + α2Xt−2. One suspects that the “best” α’s will be α1 = φ, α2 = 0. This is in fact true, and is a property
- f the “Partial Autocorrelation Function (PACF)”.
- Assume µX = 0; consider the problem of mini-
mizing the function fm(α1,m, ..., αm,m) = E
h
{Xt − α1,mXt−1 − ... − αm,mXt−m}2i , which is the MSE when Xt is forecast by ˆ Xt = α1,mXt−1 + ... + αm,mXt−m.
57 Let the minimizers be α∗
1,m, ..., α∗ m,m. The lag-
m PACF value, written φmm, is defined to be α∗
m,m.
- It can also be shown that
φmm = CORR
h
Xt − ˆ Xt, Xt−m − ˆ Xt−m
i
, where each ˆ X denotes the best (i.e. minimum MSE) predictor which is a linear function of Xt−1, ..., Xt−m+1.
- To compute:
Solve the m equations in m un- knowns 0 = −1 2 ∂fm ∂αj,m = E
h
Xt−j · {Xt − α1,mXt−1 − ... − αm,mXt−m}
i
=
h
γ(j) − α1,mγ(j − 1) − ... − αm,mγ(j − m)
i
, for j = 1, ..., m; i.e.
m
X
i=1
αi,mγ(j − i) = γ(j).
58 Then m = 1 : φ11 = ρ(1), m = 2 : φ22 = ρ(2) − ρ2(1) 1 − ρ2(1) , etc.
- Note that for an AR(1), ρ(j) = φj and so φ11 =
φ, φ22 = 0. See Figure 2.3. Figure 2.3. Sample PACF from simulated AR(1) series.
59
- In general, if {Xt} is AR(p) and stationary, then
φpp = φp and φmm = 0 for m > p. Proof: Write Xt = Pp
j=1 φjXt−j + wt, so for
m ≥ p fm(α1,m, ..., αm,m) = E
⎡ ⎢ ⎣ ⎧ ⎨ ⎩
wt + Pp
j=1
³
φj − αj,m
´
Xt−j − Pm
j=p+1 αj,mXt−j
⎫ ⎬ ⎭
2⎤
⎥ ⎦
= E
h
{wt + Z}2i , say, (where Z is uncorrelated with wt - why?), = σ2
w + E[Z2].
This is minimized if Z = 0 with probability 1, i.e. if αj,m = φj for j ≤ p and = 0 for j > p.
60
- Forecasting.
Given r.v.s Xt, Xt−1, ... (into the infinite past, in principle) we wish to forecast a future value Xt+l. Let the forecast be Xt
t+l.
We will later show that the “best” forecast is Xt
t+l = E
£Xt+l|Xt, Xt−1, ... ¤ ,
the conditional expected value of Xt+l given Xt = {Xs}t
s=−∞. Our general model identification ap-
proach, following Box/Jenkins, is then:
- 1. Tentatively identify a model, generally by look-
ing at its sample ACF/PACF.
- 2. Estimate the parameters (generally by the method
- f Maximum Likelihood, to be covered later).
This allows us to estimate the forecasts Xt
t+l,
which depend on unknown parameters, by sub- stituting estimates to obtain ˆ Xt
t+l. We define
the residuals by ˆ wt = Xt − ˆ Xt−1
t
.
61 The (adjusted) MLE of σ2
w is typically (in ARMA(p, q)
models) ˆ σ2
w =
1 T − 2p − q
T
X
t=p+1
ˆ w2
t .
The T −2p−q in the denominator is the num- ber of residuals used (T −p) minus the number
- f ARMA parameters estimated (p + q).
- 3. The residuals should “look like” white noise.
We study them, and apply various tests of whiteness. To the extent that they are not white, we look for possible alternate models.
- 4. Iterate; finally use the model to forecast.
62 8. Lecture 8
- Conditional expectation.
Example: Randomly choose a stock from a listing. Y = price in one week, X = price in the previous week. To predict Y, if we have no information about X then the best (minimum mse) constant predictor of Y is E[Y ]. (Why? What mathematical problem is being formulated and solved here?) However, suppose we also know that X = x. Then we can improve our forecast, by using the forecast ˆ Y = E[Y |X = x], the mean price of all stocks whose price in the previous week was x.
- In general, if (X, Y ) are any r.v.s, then E[Y |X =
x] is the expected value of Y , when the popula- tion is restricted to those pairs with X = x. We use the following facts about conditional expec- tation. (X, Y ) independent ⇒ E[Y |X = x] = E[Y ], E[X|X = x] = x, E[g(X)|X = x] = g(x), and more generally E[g(X, Y )|X = x] = E[g(x, Y )|X = x].
63 In particular, E[f(X)g(Y )|X = x] = f(x)E[g(Y )|X = x].
- Assume from now on that white noise is Normally
distributed. The important consequence is that these terms are now independent, not merely un- correlated, and so E [ws|wt, wt−1, ...] =
(
ws, s ≤ t, 0, s > t.
- Put h(x) = E[Y |X = x].
This is a function of x; when it is evaluated at the r.v. X we call it h(X) = E[Y |X]. We have the Double Expectation Theorem: E {E[Y |X]} = E[Y ]. The inner expectation is with respect to the conditional distribution and the outer is with re- spect to the distribution of X; the theorem can be stated as E[h(X)] = E[Y ], where h(x) is as above.
64 — Example: Y = house values, X = location (neighbourhood) of a house. E[Y ] can be
- btained by averaging within neighbourhoods,
then averaging over neighbourhoods.
- Similarly,
EX
n
EY |X [g(X, Y )|X]
- = E [g(X, Y )] .
- Minimum MSE forecasting. Consider forecasting
a r.v. Y (unobservable) using another r.v. X (or set of r.v.s); e.g. Y = Xt+l, X = Xt. We seek the function g(X) which minimizes the MSE MSE(g) = E
h
{Y − g(X)}2i . The required function is g(X) = E[Y |X] (= h(X)).
65
- Proof: We have to show that for any function g,
MSE(g) ≥ MSE(h). Write MSE(g) = E
h
{(Y − h(X)) + (h(X) − g(X))}2i = E
h
{Y − h(X)}2i + E
h
{h(X) − g(X)}2i +2E [(h(X) − g(X)) · (Y − h(X))] . We will show that the last term = 0; then we have MSE(g) = MSE(h)+E
h
{h(X) − g(X)}2i which exceeds MSE(h); equality iff g(X) = h(X) with probability 1. To establish the claim we evaluate the expected value in stages: E [(h(X) − g(X)) · (Y − h(X))] = EX
n
EY |X [(h(X) − g(X)) · (Y − h(X))|X]
- .
The inner expectation is (why?) (h(X) − g(X) · EY |X [(Y − h(X))|X] = (h(X) − g(X) ·
n
EY |X [Y |X] − h(X)
- = 0.
66
- The minimum MSE is
MSEmin = E
h
{Y − h(X)}2i = EX
n
EY |X
h
{Y − h(X)}2 |X
io
= EX {V AR[Y |X]} . We will show that V AR[Y ] = EX {V AR[Y |X]}+V AR[E{Y |X}], i.e. V AR[Y ] = MSEmin + V AR[h(X)]; thus MSEmin ≤ V AR[Y ]. V AR[Y ] is the MSE when Y is forecast by its mean and X is ignored;
- ur result is then that using the information in
X never increases MSE, and results in a strict decrease as long as V AR[h(X)] > 0, i.e. h(x) is non-constant. Analogous to within and between breakdown in ANOVA (e.g. variation in house prices within and between neighbourhoods).
67
- Proof of claim:
V AR[Y ] = E
h
{Y − E[Y ]}2i = E
h
{Y − h(X) + (h(X) − E[Y ])}2i = E
h
{Y − h(X)}2i + E
h
{h(X) − E[Y ]}2i +2E [{Y − h(X)} {h(X) − E[Y ]}] = MSEmin + V AR[h(X)] +2EX
n
EY |X [{Y − h(X)} {h(X) − E[Y ]} |X]
- .
The inner expectation is {h(X) − E[Y ]} · EY |X [Y − h(X)|X] = 0.
- Students should verify that Y −E[Y |X] is uncor-
related with X.
68
- Assume {Xt} is stationary and invertible.
We forecast Xt+l by Xt
t+l = E
h
Xt+l|Xti , where Xt = {Xs}t
s=−∞.
Note that this forecast is ‘unbiased’ in that E[Xt
t+l] = E[Xt+l].
By the linearity we have that Xt+l can be represented as Xt+l =
∞
X
k=0
ψkwt+l−k, (ψ0 = 1) so that Xt
t+l = ∞
X
k=0
ψkE[wt+l−k|Xt]. We have Xt = ψ(B)wt and (by invertibility) wt = φ(B)Xt where φ(B)ψ(B) = 1 determines φ(B). Thus conditioning on Xt is equivalent to condi- tioning on wt = {ws}t
s=−∞:
Xt
t+l
=
∞
X
k=0
ψkE[wt+l−k|wt] where E[wt+l−k|wt] =
(
wt+l−k, if l ≤ k, 0,
- therwise.
(Note that E
h
Xt+l−k|Xti = Xt+l−k if l ≤ k.)
69 Thus the forecast is Xt
t+l = ∞
X
k=l
ψkwt+l−k, with forecast error and variance Xt+l − Xt
t+l
=
l−1
X
k=0
ψkwt+l−k, V AR[Xt+l − Xt
t+l] = σ2 w l−1
X
k=0
ψ2
k.
Since {wt} is normal, we have Xt+l − Xt
t+l ∼ N
⎛ ⎝0, σ2
w l−1
X
k=0
ψ2
k
⎞ ⎠
and so a 100(1 − α)% prediction (forecast) inter- val on Xt+l is Xt
t+l ± zα/2σw
v u u u t
l−1
X
k=0
ψ2
k.
Interpretation: the probability that Xt+l will lie in this interval is 1 − α.
70
- In practice we must solve for the ψk in terms of
the AR and MA parameters of {Xt}, then sub- stitute estimates of these parameters to obtain estimates ˆ ψk. Substituting these estimates into the expressions above results in the forecast ˆ Xt
t+l;
we also must use an estimate ˆ σ2
- w. The residuals,
- r innovations are
ˆ wt = Xt − ˆ Xt−1
t
and typically ˆ σ2
w = P ˆ
w2
t /(# of residuals - # of
parameters estimated).
71 9. Lecture 9
- Example 1: AR(1) (and stationary).
Xt = φXt−1 + wt. Xt
t+l
= E
h
Xt+l|Xti = E
h
φXt+l−1 + wt+l|Xti = φE
h
Xt+l−1|Xti + E
h
wt+l|wti =
(
φXt, l = 1, φXt
t+l−1, l > 1.
Iterating: Xt
t+l = φlXt for l ≥ 1.
The calculation of the forecast was easy (it al- ways is for an AR model); determining the fore- cast variance requires us to determine the ψk’s (since V AR[Xt+l−Xt
t+l] = σ2 w
Pl−1
k=0 ψ2 k). Usu-
ally this is done numerically; in the present case it can be done explicitly: (1 − φB) Xt = wt, so Xt = (1 − φB)−1 wt =
∞
X
k=0
ψkwt−k,
72 for ψk = φk. Then
l−1
X
k=0
ψ2
k = 1 − φ2l
1 − φ2 , leading to the forecast interval φlXt ± zα/2σw
v u u t1 − φ2l
1 − φ2 . Numerically we replace φ by its estimate ˆ φ; then ˆ Xt
t+l = ˆ
φlXt and the residuals are ˆ wt = Xt − ˆ Xt−1
t
= Xt − ˆ φXt−1 (t > 1). Note the similarity with wt = Xt − φXt−1. This illustrates the fact that the residual can also be
- btained by writing wt in terms of the data and
parameters, and then replacing the parameters with estimates. The estimate of the variance
- f the noise is
ˆ σ2
w = T
X
t=2
ˆ w2
t /(T − 2).
73
- Example 2. AR(p). Similar to Example 1, Xt =
Pp
i=1 φiXt−i + wt results in
ˆ Xt
t+l = p
X
i=1
ˆ φiXt
t+l−i,
where Xt
t+l−i = Xt+l−i if l ≤ i.
Now solve (numerically) Xt = (1/φ(B))wt = ψ(B)wt to get the ψk, then the ˆ ψk and the standard errors
- f the forecasts.
The innovations are obtained from ˆ Xt−1
t
=
p
X
i=1
ˆ φiXt−1
t−i = p
X
i=1
ˆ φiXt−i to get ˆ wt = Xt −
p
X
i=1
ˆ φiXt−i, with ˆ σ2
w = T
X
t=p+1
ˆ w2
t /(T − 2p).
74
- Example 3.
MA(1) (and invertible). Xt = wt − θwt−1 = (1 − θB)wt ⇒ wt =
∞
X
k=0
θkXt−k. We make the approximation X0 = w0 = 0, and then wt =
t−1
X
k=0
θkXt−k. Now Xt
t+l = E
h
wt+l − θwt+l−1|wti =
(
−θwt, l = 1, 0, l > 1, with wt obtained from the preceding equation. This gives residuals ˆ wt =
t−1
X
k=0
ˆ θkXt−k, and ˆ σ2
w = T
X
t=1
ˆ w2
t /(T − 1)
75 Trivially, since ψ0 = 1, ψ1 = −θ and ψk = 0 for k > 1, we have
l−1
X
k=0
ψ2
k =
(
1, l = 1, 1 + θ2, l > 1. The prediction intervals are
⎧ ⎨ ⎩
−ˆ θ ˆ wt ± zα/2ˆ σw, l = 1, 0 ± zα/2ˆ σw
q
1 + ˆ θ2, l > 1.
- Students should write out the procedure for an
invertible MA(q) model.
- Example 4.
ARMA(1,1), stationary and invert- ible. In general, when there is an MA compo- nent we make the approximation X0 = w0 = 0. The model is (1 − φB)Xt = (1 − θB)wt, i.e. Xt = φXt−1 + wt − θwt−1, leading to Xt
t+l
= φXt
t+l−1 + wt t+l − θwt t+l−1
=
(
φXt − θwt, l = 1, φXt
t+l−1,
l > 1.
76 To obtain a value for wt we write wt = (1 − φB)(1 − θB)−1Xt = (1 − φB)
∞
X
k=0
θkBk · Xt =
⎛ ⎝1 +
∞
X
k=1
θk−1 (θ − φ) Bk
⎞ ⎠ Xt
with approximation wt = Xt +
t−1
X
k=1
θk−1 (θ − φ) Xt−k. For the forecast variance we reverse φ and θ in the above: Xt = (1 − θB)(1 − φB)−1wt =
∞
X
k=0
ψkwt−k with ψk = φk−1 (φ − θ) (and ψ0 = 1); thus V AR
h
Xt+l − Xt
t+l
i
= σ2
w l−1
X
k=0
ψ2
k
= σ2
w
"
1 + (φ − θ)2 1 − φ2(l−1) 1 − φ2
#
.
77 Students should verify that obtaining the residuals from ˆ wt = Xt − ˆ Xt−1
t
results in the same expres- sion as substituting estimates into the expression for wt given above, i.e. ˆ wt = Xt +
t−1
X
k=1
ˆ θk−1 ³ˆ θ − ˆ φ
´
Xt−k. Then ˆ σ2
w = PT t=2 ˆ
w2
t /(T − 3).
- In general, in an ARMA(p, q) model, the assump-
tion X0 = 0 is necessary if we are to be able to calculate any of the residuals. The resulting ex- pression - of the form ˆ wt = Pt−1
k=0 ˆ
αkXt−k - could also be used to calculate the first p residuals. So there is a trade-off: we could get numerical but approximate values for ˆ w1, ..., ˆ wp, or we could just not bother and calculate the remaining residuals more accurately. Here we take the latter ap- proach. This gives ˆ σ2
w = T
X
t=p+1
ˆ w2
t /(T − 2p − q).
78 10. Lecture 10
- Estimation.
One method is the method of mo- ments, in which we take expressions relating pa- rameters to expected values, replace the expected values by series averages, then solve for the un- known parameters. — e.g. E[Xt] = µ becomes T −1 PT
t=1 xt = ˆ
µ. — e.g. For an AR(1) model we could replace γ(k) by the sample autocovariance ˆ γ(k) in the Yule-Walker equations, then solve them as be- fore to get ˆ γ(0) = ˆ σ2
w
1 − ˆ φ2, ˆ γ(1) = ˆ φˆ γ(0), yielding ˆ φ = ˆ ρ(1), ˆ σ2
w
= ˆ γ(0)
³
1 − ˆ φ2´ .
79 Recall we previously used the (adjusted Max- imum Likelihood) estimate ˆ σ2
w,MLE = T
X
t=2
ˆ w2
t /(T − 2).
Students should show that the difference be- tween these two estimates is of the order (i.e. a multiple of) 1/T; in this sense the two es- timates are asymptotically (i.e. as T → ∞) equivalent. — The same technique applied to the MA(1) model starts with ρ(1) = −θ/
³
1 + θ2´ ( |ρ(1)| < 1/2 by invertibility: |θ| < 1), then we solve ˆ ρ(1) = −ˆ θ/
³
1 + ˆ θ2´ . If |ˆ ρ(1)| < 1/2 there is a real root ˆ θ with |ˆ θ| < 1 and we use it. (Otherwise |ˆ θ| = 1 and the estimated model is not invertible.) But even when |ˆ θ| < 1 the estimate can be quite inefficient (highly varied) relative to the MLE, which we consider next.
80
- Maximum Likelihood Estimation.
We observe
x = (x1, ..., xT)0; suppose the joint probability
density function (pdf) is f(x|α) for a vector α = (α1, ..., αp)0 of unknown parameters. E.g. if the Xt are independent N(µ, σ2) the joint pdf is
T
Y
t=1
⎧ ⎨ ⎩ ³
2πσ2´−1/2 e−(xt−µ)2
2σ2
⎫ ⎬ ⎭
=
³
2πσ2´−T/2 e−
PT
t=1(xt−µ)2 2σ2
. When evaluated at the numerical data this is a function of α alone, denoted L (α|x) and known as the Likelihood function. The value ˆ
α which
maximizes L (α|x) is known as the Maximum Likelihood Estimator (MLE). Intuitively, the MLE makes the observed data “most likely to have oc- curred”. — We put l(α) = ln L (α|x), the log-likelihood, and typically maximize it (equivalent to max-
81 imizing L) by solving the likelihood equations
³˙
l(α) =
´ ∂
∂αl(α) = 0, i.e. ∂ ∂αj l(α) = 0, j = 1, ..., p. The vector ˙ l(α) is called the gradient. — The MLE has attractive large sample proper- ties. With α0 denoting the true value, we typically have that √ T (ˆ
α − α0) has a lim-
iting (as T → ∞) normal distribution, with mean 0 and covariance matrix
C = I−1(α0)
where I(α0) is the information matrix defined below. — The role of the covariance matrix is that if a random vector Y = (Y1, ..., Yp)0 has covari- ance matrix C, then COV
h
Yj, Yk
i
= Cjk; in particular V AR[Yj] = Cjj.
82 — The information matrix is given by
I(α0) =
lim
T→∞
½ 1
T E
h˙
l(α0)˙ l(α0)0i¾ ; a more convenient and equivalent form (for the kinds of models we will be working with) is
I(α0) =
lim
T→∞
½ 1
T E
h
−¨ l(α0)
i¾
, where ¨ l(α) is the Hessian matrix with (j, k)th element ∂2l(α)/∂αj∂αk. — To apply these results we estimate I(α0) by
ˆ I = I(ˆ α).
Denote the (j, k)th element of ˆ
I−1 by ˆ Ijk.
Then the normal approximation is that √ T
³
ˆ αj − αj
´
is asymptotically normally distributed with mean zero and variance estimated by ˆ
Ijj, so that
ˆ αj − αj sj ≈ N(0, 1), where sj =
s
ˆ Ijj
T .
83 Then, e.g., the p-value for the hypothesis H0 : αj = 0 against a two-sided alternative is p = 2P
Ã
Z >
¯ ¯ ¯ ¯ ¯
ˆ αj sj
¯ ¯ ¯ ¯ ¯ !
, supplied on the ASTSA printout.
- Example 1.
AR(1) with a constant: Xt = φ0 + φ1Xt−1+wt. As is commonly done for AR mod- els we will carry out an analysis conditional on X1; i.e. we act as if X1 is not random, but is the con- stant x1. We can then carry out the following steps:
- 1. Write out the pdf of X2, ..., XT .
- 2. From 1. the log-likelihood is l(α) = ln L (α|x),
where α =
³
φ0, φ1, σ2
w
´0.
- 3. Maximize l(α) to obtain the MLEs
³ˆ
φ0, ˆ φ1, ˆ σ2
w
´
.
- 4. Obtain the information matrix and its esti-
mated inverse.
84
- Step 1.
(Assuming normally distributed white noise.) Transformation of variables. If the pdf
- f w2, ..., wT is g(w|α) and we write the w’s in
terms of the X’s: wt = Xt − φ0 − φ1Xt−1 then the pdf of X2, ..., XT is f(x|α) =g(w|α)
¯ ¯ ¯ ¯ µ∂w
∂x
¶¯ ¯ ¯ ¯+
where
³∂w
∂x
´
is the matrix of partial derivatives, with
µ∂w
∂x
¶
jk
= ∂wj ∂xk , and |·|+ is the absolute value of the determinant. On the right hand side g(w|α) is evaluated by replacing the w’s with their expressions in terms
- f the x’s.
In this AR(1) example, g(w|α) =
³
2πσ2
w
´−(T−1)/2 e
−
PT
t=2 w2 t 2σ2 w
85 and ∂w ∂x =
⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
1 · · · −φ 1 ... . . . −φ ... ... . . . ... ... 1 · · · −φ 1
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
with determinant = 1 (why?). Thus f(x|α) =
³
2πσ2
w
´−(T−1)/2 e
−
PT
t=2(xt−φ0−φ1xt−1)2 2σ2 w
. This determinant will always = 1 if we can write wt as Xt+a function of Xt−1, ..., X1. But (i) this can always be done in AR models, and (ii) in models with MA parts to them we assume in- vertibility + “X0 = 0” , so that this can be done there as well. So for all models considered in this course, |∂w/∂x|+ = 1.
86
- Step 2.
l(α) = ln
⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ ³
2πσ2
w
´−(T−1)/2 e
−
PT
t=2(xt−φ0−φ1xt−1)2 2σ2 w
⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭
= −T − 1 2 ln σ2
w −
PT
t=2 (xt − φ0 − φ1xt−1)2
2σ2
w
+const. = −T − 1 2 ln σ2
w − S(φ0, φ1)
2σ2
w
+ const., say.
87 11. Lecture 11
- Step 3.
To maximize l over φ0, φ1 we minimize S(φ0, φ1). Consider a regression model xt = φ0 + φ1xt−1 + error, t = 2, ..., T. (*) In a regression model yi = φ0 + φ1xi + error, i = 1, ..., n the least squares estimates of the slope and inter- cept minimize P (yi − φ0 − φ1xi)2 and are ˆ φ0 = ¯ y − ˆ φ1¯ x, ˆ φ1 =
P (xi − ¯
x) (yi − ¯ y)
P (xi − ¯
x)2
Ã
= ˆ γY X(0) ˆ γX(0)
!
. Thus the minimizers of S are the LSEs in the re- gression model (*); they are obtained numerically by doing the regression and formulas for them can (and should, by the students) be written out. Students should verify that ˆ φ1 ≈ ˆ ρ(1), ˆ φ0 ≈ ¯ x
³
1 − ˆ φ1
´
.
88 The likelihood equation for σ2
w is
0 = ∂l ∂σ2
w
= −T − 1 2σ2
w
+ S(ˆ φ0, ˆ φ1) 2σ4
w
, satisfied by ˆ σ2
w
= S(ˆ φ0, ˆ φ1) T − 1 =
PT
t=2
³
xt − ˆ φ0 − ˆ φ1xt−1
´2
T − 1 =
PT
t=2 ˆ
w2
t
T − 1 . A usual adjustment for bias is to replace T −1 by T − 3 = (T − 1) − 2 = # of residuals - number
- f parameters estimated.
89
- Step 4.
Information matrix: ˙ l =
⎛ ⎜ ⎜ ⎜ ⎜ ⎝
− 1
2σ2
w
∂S ∂φ0
− 1
2σ2
w
∂S ∂φ1
−T−1
2σ2
w +
S 2σ4
w
⎞ ⎟ ⎟ ⎟ ⎟ ⎠
, −¨ l =
⎛ ⎜ ⎜ ⎜ ⎜ ⎝
1 2σ2
w
∂2S ∂φ0∂φ0 1 2σ2
w
∂2S ∂φ1∂φ0
− 1
2σ4
w
∂S ∂φ0
∗
1 2σ2
w
∂2S ∂φ1∂φ1
− 1
2σ4
w
∂S ∂φ1
∗ ∗ −T−1
2σ4
w + S
σ6
w
⎞ ⎟ ⎟ ⎟ ⎟ ⎠
. Calculate ∂S ∂φ0 = −2
T
X
t=2
(xt − φ0 − φ1xt−1) = −2
T
X
t=2
wt, with expectation 0, ∂S ∂φ1 = −2
T
X
t=2
xt−1 (xt − φ0 − φ1xt−1) = −2
T
X
t=2
xt−1wt,
90 with expectation (why?) −2
T
X
t=2
COV [Xt−1,wt] = 0, ∂2S ∂φ0∂φ0 = 2 (T − 1) , with expectation 2 (T − 1) , ∂2S ∂φ1∂φ0 = 2
T
X
t=2
xt−1, with expectation 2 (T − 1) µ, ∂2S ∂φ1∂φ1 = 2
T
X
t=2
x2
t−1, with expectation
2 (T − 1)
³
γ(0) + µ2´ . Then using E[S] = E[
T
X
t=2
w2
t ] = (T − 1)σ2 w,
91 we get 1 T E[−¨ l] = T − 1 T
⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
1 σ2
w
µ σ2
w
∗ (γ(0)+µ2)
σ2
w
∗ ∗ − 1
2σ4
w + σ2 w
σ6
w =
1 2σ4
w
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
→
T→∞
1 σ2
w
⎛ ⎜ ⎜ ⎝
1 µ µ γ(0) + µ2
1 2σ2
w
⎞ ⎟ ⎟ ⎠
= 1 σ2
w
Ã
A 0T
1/(2σ2
w)
!
= I(α0) The inverse is
I−1(α0) = σ2
w
Ã
A−1 0T
2σ2
w
!
, where
A−1
= 1 γ(0)
Ã
γ(0) + µ2 −µ −µ 1
!
. Thus, e.g. the normal approximation for ˆ φ1 is that ˆ φ1 − φ1 ≈ N
Ã
0, σ2
w
Tγ(0) = 1 − φ2
1
T
!
,
92 with standard error s2(ˆ φ1) = 1 − ˆ φ2
1
T and
ˆ φ1−φ1 s(ˆ φ1) ≈ N(0, 1). A 100(1−α)% confidence
interval is ˆ φ1 ± zα/2s(ˆ φ1).
- In general, for an AR(p):
Xt = φ0 +
p
X
i=1
φiXt−i + wt we minimize S(φ) =
T
X
t=p+1
⎛ ⎝xt − φ0 −
p
X
i=1
φixt−i
⎞ ⎠
2
by fitting a regression model xt = φ0 +
p
X
i=1
φixt−i + error for t = p + 1, ..., T. The resulting LSEs are ˆ
φ
and the associated mean square of the residuals
93 is ˆ σ2
w =
S(ˆ
φ)
T − 2p − 1. The large-sample standard errors are obtained by ASTSA and appear on the printout. (Time do- main > Autoregression or Time domain > ARIMA; then answer the question that appear. More on this later.)
- Example 2.
ARMA(p,q). Model is Xt −
p
X
j=1
φjXt−j = wt −
q
X
k=1
θkwt−k. Assume that Xt = wt = 0 and solve successively for the wt’s in terms of the Xt’s: wt = Xt −
p
X
j=1
φjXt−j +
q
X
k=1
θkwt−k; w1 = X1, w2 = X2 − φ1X1 + θ1w1, etc.
94 In this way we write
³
wp+1, ..., wT
´
in terms of
³
xp+1, ..., xT
´
. The Jacobian is again = 1 and so f(x|α) =
³
2πσ2
w
´−(T−p)/2 e
−S(φ,θ)
2σ2 w ,
where S(φ, θ) = PT
t=p+1 w2 t (φ, θ).
(It is usual to omit the first p wt’s, since they are based
- n so little data and on the assumption men-
tioned above. Equivalently we are conditioning
- n them.)
Now S(φ, θ) is minimized numeri- cally to obtain the MLEs ˆ
φ, ˆ θ.
The MLE of σ2
w
is obtained from the likelihood equation and is S(ˆ
φ, ˆ θ)/(T − p); it is often adjusted for bias to
give ˆ σ2
w =
S(ˆ
φ, ˆ θ)
T − 2p − q.
95
- The matrix
I(α0) =
lim
T→∞
½ 1
T E
h
−¨ l(α0)
i¾
is sometimes estimated by the “observed infor- mation matrix” 1
T
³
−¨ l(ˆ
α)
´
evaluated at the data {xt}. This is numerically simpler.
- Students should write out the procedure for an
MA(1) model.
- The numerical calculations also rely on a modifi-
cation of least squares regression. Gauss-Newton algorithm: We are to minimize S(ψ) =
X
t
w2
t (ψ),
where ψ is a p-dimensional vector of parameters. The idea is to set up a series of least squares re- gressions converging to the solution. First choose an initial value ψ0. (ASTSA will put all compo- nents = .1 if “.1 guess” is chosen. Otherwise, it will compute method of moments estimates, by equating the theoretical and sample ACF and PACF values. This takes longer.)
96
- Now expand wt(ψ) around ψ0 by the Mean Value
Theorem: wt(ψ) ≈ wt(ψ0) + ˙ w0
t(ψ0) (ψ − ψ0)
(1) = “yt − z0
tβ”,
where yt = wt(ψ0), zt = − ˙ wt(ψ0), β = ψ − ψ0. Now S(ψ) ≈
X
t
³
yt − z0
tβ
´2
is minimized by regressing {yt} on {zt} to get the LSE ˆ
β1.
We now set
ψ1 = ˆ β1 + ψ0,
expand around ψ1 (i.e. replace ψ0 by ψ1 in (1)),
- btain a revised estimate ˆ
β2.
Iterate to conver- gence.
97 12. Lecture 12
- A class of nonstationary models is obtained by
taking differences, and assuming that the differ- enced series is ARMA(p,q): ∇Xt = Xt − Xt−1 = (1 − B)Xt, ∇2Xt = ∇(∇Xt) = (1 − B)2Xt, etc. We say {Xt} is ARIMA(p,d,q) (“Integrated ARMA”) if ∇dXt is ARMA(p,q). If so, φ(B)(1 − B)dXt = θ(B)wt for an AR(p) polynomial φ(B) and an MA(q) polynomial θ(B). Since φ(B)(1 − B)d has roots
- n the unit circle, {Xt} cannot be stationary.
The differenced series
n
∇dXt
- is the one we an-
alyze.
98
- It may happen that the dependence of a series
- n its past is strongest at multiples of the sam-
pling unit, e.g. monthly economic data may ex- hibit strong quarterly or annual trends. To model this, define seasonal AR(P) and MA(Q) charac- teristic polynomials Φ(Bs) = 1 − Φ1Bs − Φ2B2s − ... − ΦPBPs, Θ(Bs) = 1 − Θ1Bs − Θ2B2s − ... − ΘQBQs. A seasonal ARMA(P,Q) model, with season s, is defined by Φ(Bs)Xt = Θ(Bs)wt. This can be combined with the hierarchy of or- dinary ARMA models, and with differencing, to give the full ARIMA(p,d,q)×(P,D,Q)s model de- fined by Φ(Bs)φ(B)(1−Bs)D(1−B)dXt = Θ(Bs)θ(B)wt.
99 — Example: the ARIMA(0,1,1)×(0,1,1)12 model has Φ(Bs) = 1, φ(B) = 1, d = D = 1, Θ(Bs) = 1 − ΘB12, θ(B) = 1 − θB. Thus (1 − B12)(1 − B)Xt =
³
1 − ΘB12´ (1 − θB) wt. Expanding: Xt = Xt−1 + Xt−12 − Xt−13 +wt − θwt−1 − Θwt−12 + Θθwt−13. This model often arises with monthly economic data. — The analysis of the ACF and PACF proceeds along the same lines as for the previous models (see below).
100
- Choosing an appropriate model.
Some guiding properties of the ACF/PACF: — Nonstationarity: ACF drops off very slowly (a root of the AR characteristic equation with |B| near 1 will do this too); PACF large (in absolute value) at 1 (but only at 1 could in- dicate AR(1)). Try taking differences ∇dXt, d = 1, 2. Rarely is d > 2. Don’t be too hasty to take differences; try ARMA models first. — Seasonal nonstationarity: ACF zero except at lags s, 2s, ...; decays slowly,or PACF very large at s. Try ∇d
sXt = (1 − Bs)dXt.
— AR(p) behaviour: PACF zero for m > p. — Seasonal AR(P): PACF zero except at m = s, 2s, ..., Ps. — MA(q): ACF zero for m > q. — Seasonal MA(Q): ACF zero except at m = s, 2, ..., Qs.
101
- Note:
Sometimes one fits an MA(q) model, or an ARMA(p,q) model with q > 0, and finds that most residuals are of the same sign. This is gen- erally a sign that µX 6= 0 (recall an MA(q) has mean zero). A remedy is to fit a constant as well (possible in ASTSA only in the AR option) or to subtract the average ¯ x from the series before the analysis is carried out.
- Principle of Parsimony:
We seek the simplest model that is adequate. We can always “improve” the fit by throwing in extra terms, but then the model might only fit these data well.
102
- See Figures 2.4 - 2.11.
Example: U.S. Federal Reserve Board Production Index - an index of eco- nomic productivity. Plots of data and ACF/PACF clearly indicate nonstationarity. ACF of ∇Xt in- dicates seasonal (s = 12) nonstationarity. ACF and PACF of ∇12∇Xt shows possible models ARIMA(p = 1 − 2, d = 1, q = 1 − 4) ×(P = 2, D = 1, Q = 1)s=12. The ARIMA Search facility, using BIC, picks out ARIMA(1, 1, 0) × (0, 1, 1)12. Note q = P = 0; the other AR and MA features seem to have ac- counted for the seeming MA and seasonal AR(2) behaviour.
103 Figure 2.4. U.S. Federal Reserve Board Production Index (frb.asd in ASTSA). Non-stationarity of mean is obvious.
104 Figure 2.5. ACF decays slowly, PACF spikes at m = 1. Both indicate nonstationarity. Figure 2.6. ∇Xt; nonstationary variance exhibited.
105 Figure 2.7. ∇Xt; ACF shows seasonal nonstationarity. Figure 2.8. ∇12∇Xt. Peak in ACF at m = 12 indicates seasonal (s = 12) MA(1). PACF indicates AR(1) or AR(2) and (possibly) seasonal AR(2).
106 Figure 2.9. Time domain > ARIMA search applied to Xt. 3 × 5 × 3 × 3 = 135 possible models to be fitted.
107 Figure 2.10. ARIMA search with BIC criterion picks
- ut ARIMA(1,1,0)×(0,1,1)12.
The (default) AICc criterion picks out ARIMA(0,1,4)×(2,1,1)12. One could argue for either one.
108 Figure 2.11. Output for “best” model. Save the residuals for further analysis.
109
- There are several “information criteria”. All seek
to minimize the residual variation while imposing penalties for nonparsimonious models. Let K be the number of AR and MA parameters fitted, and let ˆ σ2
w(K) be the estimated variance of the
residual noise. Then AIC(K) = ln ˆ σ2
w(K) + 2K
T , AICc(K) = ln ˆ σ2
w(K) +
T + K T − K − 2, (small sample modification), BIC(K) = ln ˆ σ2
w(K) + K ln T
T , etc.
- See Figures 2.12 - 2.15. Residual analysis. Save
the residuals. Plot their ACF/PACF; none of the values should be significant (in principle!). These plots were examined but are not included here; all satisfactory. Also plot residuals against ∇12∇Xt−1; they should be approximately uncor- related and not exhibit any patterns. (A conse- quence of the fact that Y − E[Y |X] is uncorre- lated with X is that Xt − Xt−1
t
is uncorrelated
110 with Xs for s < t, i.e. except for the fact that the residuals use estimated parameters, they are uncorrelated with the predictors. This assumes stationarity, so here we use the differenced series.) Click on Graph > Residual tests. Figure 2.12. Residuals vs. time.
111 Figure 2.13. Store the residuals ˆ wt = { ˆ w15, ..., ˆ w372} and the differenced predictors ∇12∇Xt = {∇12∇X15, ..., ∇12∇X372} (see Transform > Take a subset). Then use Graph > 2D Scatterplot and “detail view” with the ˆ wt as series 1 and ∇12∇Xt as series 2, and “lag = 1” to get this plot. Use “Full view”- or merely double click on this plot - to get plots at a variety of lags, as described in the ASTSA manual.
112 Figure 2.14. Graphical output from Graph > Residual tests.
113 Figure 2.15. Printed output from Graph > Residual tests.
114
- Cumulative spectrum test.
We will see later that if the residuals { ˆ wt} are white noise, then their “spectrum” f(ν) is constantly equal to the variance, i.e. is ‘flat’. Then the integrated, or cumulative spectrum, is linear with slope = vari- ance; here it is divided by ˆ σ2
w/2 and so should
be linear with a slope of 2. Accounting for ran- dom variation, it should at least remain in a band around a straight line through the origin, with a slope of 2. The graphical output confirms this and the printed output gives a non-significant p- value to the hypothesis of a flat spectrum.
115 13. Lecture 13
- Box-Pierce test. Under the hypothesis of white-
ness we expect ˆ ρw(m) to be small in absolute value for all m; a test can be based on Q = T(T + 2)
M
X
m=1
ˆ ρ2
w(m)
T − m, which is approximately ∼ χ2
M−K under the null
hypothesis. The p-value is calculated and re- ported for M − K = 1, 20.
- Fluctuations test (= Runs test).
Too few or too many fluctuations, i.e. changes from an up- ward trend to a downward trend or vice versa) constitute evidence against randomness. E.g. an AR(1) (ρ(1) = φ) with φ near +1 will have few fluctuations, φ near −1 will result in many.
116
- Normal scores (Q-Q) test.
The quantiles of a distribution F = Fw are the values F −1(q), i.e. the values below which w lies with probability q. The sample versions are the order statistics w(1) < w(2) < ... < w(T): the probability of a value w falling at or below w(t) is estimated by t/T, so w(t) can be viewed as the (t/T)th sam- ple quantile. If the residuals are normal then a plot of the sample quantiles against the standard normal quantiles should be linear, with intercept equal to the mean and slope equal to the standard deviation (follows from F −1(q) = µ+σΦ−1(q) if F is the N(µ, σ2) df; students should verify this identity). The strength of the linearity is mea- sured by the correlation between the two sets of quantiles; values too far below 1 lead to rejection
- f the hypothesis of normality of the white noise.
117
- The residuals from the more complex model cho-
sen by AICc looked no better.
- A log or square root transformation of the original
data sometimes improves normality; logging the data made no difference in this case.
- In contrast, if we fit an ARIMA(0, 1, 0)×(0, 1, 1)12,
then the ACF/PACF of the residuals clearly shows the need for the AR(1) component - see Figures 2.16, 2.17. Note also that the largest residual is now at t = 324 - adding the AR term has ex- plained the drop at this point to such an extent that it is now less of a problem than the smaller drop at t = 128.
118 Figure 2.16. ACF and PACF of residuals from ARIMA(0, 1, 0) × (0, 1, 1)12 fit.
119 Figure 2.17. Residual tests for ARIMA(0, 1, 0) × (0, 1, 1)12 fit.
120 Forecasting in the FRB example. Fitted model is (1 − φB) ∇∇12Xt =
³
1 − ΘB12´ wt; (1) this is expanded as Xt = (1 + φ) Xt−1 − φXt−2 + Xt−12 − (1 + φ) Xt−13 + φXt−14 + wt − Θwt−12.(2) Conditioning on Xt gives Xt
t+l = (1 + φ) Xt t+l−1 − φXt t+l−2 + Xt t+l−12
− (1 + φ) Xt
t+l−13 + φXt t+l−14 + wt t+l − Θwt t+l−12.
We consider the case 1 ≤ l ≤ 12 only; l = 13 left as an exercise. Note that the model is not stationary but seems to be invertible (since |ˆ Θ| = .6962 < 1). Thus wt can be expressed in terms of Xt and, if we assume w0 = X0 = 0, we can express Xt in terms of wt by iterating (2). Then conditioning on Xt is equivalent to conditioning on wt. Thus Xt
t+l−k = Xt+l−k for
k = 12, 13, 14 and wt
t+l = 0:
Xt
t+l
= (1 + φ) Xt
t+l−1 − φXt t+l−2 + Xt+l−12
− (1 + φ) Xt+l−13 + φXt+l−14 − Θwt
t+l−12.
121 These become l = 1 : Xt
t+1 = (1 + φ) Xt − φXt−1 + Xt−11 −
(1 + φ) Xt−12 + φXt−13 − Θwt−11, (3) l = 2 : Xt
t+2 = (1 + φ) Xt t+1 − φXt + Xt−10 −
(1 + φ) Xt−11 + φXt−12 − Θwt−10, 3 ≤ l ≤ 12 : from previous forecasts and wt−9, ..., wt. To get wt, .., wt−11 we write (1) as wt = Θwt−12 +
(
Xt −
"
(1 + φ) Xt−1 − φXt−2+ Xt−12 − (1 + φ) Xt−13 + φXt−14
#)
= Θwt−12 + ft, say (4) and calculate successively, using the assumption w0 = X0 = 0, w1 = f1, w2 = f2, , ..., w12 = f12, w13 = Θw1 + f13, w14 = Θw2 + f14, etc.
122 To get the residuals ˆ wt = Xt − ˆ Xt−1
t
put t = t − 1 in (3): Xt−1
t
= (1 + φ) Xt−1 − φXt−2 + Xt−12 − (1 + φ) Xt−13 + φXt−14 − Θwt−12, so ˆ wt = Xt − ˆ Xt−1
t
= (4) with all parameters esti- mated and ˆ σ2
w = T
X
t=15
ˆ w2
t /(T − 16).
(The first 14 residuals employ the assumption w0 = X0 = 0 and are not used.) The forecast variances and prediction intervals require us to write Xt = ψ(B)wt, and then PI = ˆ Xt
t+l ± zα/2ˆ
σw
s X
1≤k<l
ˆ ψ2
k.
The model is not stationary and so the coefficients of ψ(B) will not be absolutely summable, however under the assumption that w0 = 0, only finitely many of the ψk are needed. Then in (2),
(
(1 − φB) (1 − B)
³
1 − B12´ · (1 + ψ1B + ψ2B2 + ... + ψkBk + ...)
)
wt =
³
1 − ΘB12´ wt,
123 so (1 − (1 + φ) B + φB2 − B12 + (1 + φ) B13 − φB14)· (1 + ψ1B + ψ2B2 + ... + ψkBk + ...) =
³
1 − ΘB12´ . For 1 ≤ k < 12 the coefficient of Bk is = 0 on the RHS; on the LHS it is ψ1 − (1 + φ) (k = 1); ψk − (1 + φ) ψk−1 + φψk−2 (k > 1). Thus ψ0 = 1, ψ1 = 1 + φ, ψk = (1 + φ) ψk−1 − φψk−2, k = 2, ..., 11.
- See Figures 2.18 - 2.20.
124 Figure 2.18. Time domain > ARIMA, to fit just one
- model. (See also Time domain > Autoregression.)
Figure 2.19. Forecasts up to 12 steps ahead.
125 Figure 2.20. Graphical output, including forecasts.
126 14. Lecture 14 (Review) Recall SOI and Recruits series. We earlier attempted to predict Recruits from SOI by regressing Recruits on lagged SOI (m = 3, ..., 12) - 11 parameters, including the intercept. See Figure 2.21. Figure 2.21. Recruits - data and predictions from a regression on SOI lagged by m = 3, ..., 12 months. R2 = 67%.
127 Let’s look for a better, and perhaps more parsimo- nious, model. Put Xt = ∇12SOIt, Yt = ∇12Recruitst (both re-indexed: t = 1, ..., 441) - this is to remove an annual trend, and to make this example more in- teresting. Both look stationary. (What do we look for in order to make this statement?). Consider the CCF, with Yt as input and Xt as output (Figure 2.22). Figure 2.22. COV [∇12SOI, ∇12RECRUITS] is largest at lags m = −8, −9, −10. Largest (absolute) values of COV [Xt+m, Yt] = COV [Xt, Yt−m]
128 are at m = −8, −9, −10, indicating a linear relation- ship between Xt and Yt+8, Yt+9,Yt+10. Thus we might regress Yt+10
- n Yt+9, Yt+8 and Xt; equiv-
alently Yt on Yt−1, Yt−2, Xt−10: Yt = β1Yt−1+β2Yt−2+β3Xt−10+Zt, (t = 11, ..., 441) for a series {Zt} following a time series model, and being white noise if we’re lucky. See Figure 2.23. Figure 2.23. Data and regression fit, Yt on Yt−1, Yt−2, Xt−10. R2 = 90%.
129 Multiple regression on Y(t) AICc = 6.07724 Variance = 158.450 df = 428 R2 = 0.8995 predictor coef
- st. error
t-ratio p-value beta(1) 1.3086 .0445 29.4389 .000 beta(2)
- .4199
.0441
- 9.5206
.000 beta(3)
- 4.1032
1.7181
- 2.3883
.017 Figure 2.24. ACF and PACF for residual series {Zt}.
130 The residual series {Zt} still exhibits some trends
- see Figure 2.24 - seasonal ARMA(4, 1)12? ARIMA
Search, with AICc, selects ARMA(1, 1)12. With BIC, ARMA(1, 2)12 is selected. Each of these was
- fitted. Both failed the cumulative spectrum test, and
the Box-Pierce test with M −K = 1, although the p- values at M − K = 20 were non-significant. (What are these, and what do they test for? What does “p-value” mean?) This suggested adding an AR(1) term to the simpler of the two previous models. In terms of the backshift operator (what is it? how are these polynomials manipulated?): (1 − φB)
³
1 − ΦB12´ Zt =
³
1 − ΘB12´ wt.
131 ARIMA(1,0,0)x(1,0,1)x12 from Z(t) AICc = 5.45580 variance = 85.0835 d.f. = 415 Start values derived from ACF and PACF predictor coef
- st. error
t-ratio p-value AR(1)
- .0852
.04917
- 1.7327
.083 SAR(1)
- .1283
.05030
- 2.5516
.011 SMA(1) .8735 .02635 33.1524 .000 (1 +.09B1) (1 +.13B12) Z(t) = (1 -.87B12) w(t) See Figure 2.25. All residuals tests were passed, with the exception of the Normal Scores test (what does this mean? how does this test work?).
132 Figure 2.25. ACF and PACF of residuals from an ARIMA(1, 0, 0) × (1, 0, 1)12 fit to Zt. Note that:
- {Zt} seems to be linear, hence stationary. (What
does linearity mean, and how can we make this claim? How can we then represent Zt as a linear series Zt = α(B)wt = P∞
k=0 αkwt−k?)
- {Zt} seems to be invertible.
(What does invert- ibility mean, and how can we make this claim? Representation?)
133 The polynomial β(B) = 1 − β1B − β2B2 has zeros B = 1.34, 1.77. This yields the additional represen- tation Yt = β3B10 β(B) Xt + α(B) β(B)wt = β3γ(B)B10Xt + δ(B)wt, with 1/β(B) expanded as a series γ(B) and δ(B) = α(B)/β(B). Assume {wt} independent of {Xt, Yt}.
134 Predictions: You should be sufficiently familiar with the prediction theory covered in class that you can follow the following development, although I wouldn’t expect you to come up with something like this on your
- wn. (The point of knowing the theory is so that you
can still do something sensible when the theory you know doesn’t quite apply!) The best forecast of Yt+l is ˆ Y t
t+l = E
h
Yt+l|Y ti (es- timated), in the case of a single series {Yt}. (Why?
- you should be able to apply the Double Expectation
Theorem, so as to derive this result. And what does “best” mean, in this context?)
135 If we think of the data now as consisting of the union
- f the series {Xt, Yt} then it follows that the predic-
tions are ˆ Y t
t+l = E
h
Yt+l|Xt, Y ti = E
h
β1Yt+l−1 + β2Yt+l−2|Xt, Y ti +E
h
β3Xt+l−10|Xt, Y ti + E
h
Zt+l|Xt, Y ti = β1 ˆ Y t
t+l−1 + β2 ˆ
Y t
t+l−2 + β3 ˆ
Xt
t+l−10 +
X
k≥l
αkwt+l−k. Upon estimating parameters these are =
⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩
ˆ β1Yt + ˆ β2Yt−1 + ˆ β3Xt−9, l = 1, ˆ β1 ˆ Y t
t+1 + ˆ
β2Yt + ˆ β3Xt−8, l = 2, ˆ β1 ˆ Y t
t+l−1 + ˆ
β2 ˆ Y t
t+l−2 + ˆ
β3Xt+l−10 l = 3, ..., 10. +
X
k≥l
ˆ αk ˆ wt+l−k (= ˆ Zt
t+l, the forecast from the 2nd fit).
For l > 10 we would need to obtain the forecasts ˆ Xt
t+l−10 as well.
136 For l = 1, ..., 10, from Yt = β3γ(B)B10Xt + δ(B)wt, we obtain Yt+l = β3γ(B)Xt+l−10 + δ(B)wt+l, ˆ Y t
t+l
= β3γ(B)Xt+l−10 +
X
k≥l
δkBkwt+l and hence Yt+l − ˆ Y t
t+l
=
X
k<l
δkBkwt+l =
l−1
X
k=0
δkwt+l−k, with V AR
h
Yt+l − ˆ Y t
t+l
i
= σ2
w
Pl−1
k=0 δ2 k.
This leads to prediction intervals ˆ Y t
t+l ± zα/2ˆ
σw
v u u u t
l−1
X
k=0
ˆ δ2
k.
How now are the ˆ δk computed?
137 Other theoretical things you should be familiar with:
- Calculations of ACFs (pretty simple for an MA;
derive and solve the Yule-Walker equations for an AR). What is the important property of the ACF
- f an MA?
- Calculation of PACFs - define the PACF, then do
the required minimization. What is the impor- tant property of the PACF of an AR?
- Maximum likelihood estimation
— The end result of our theoretical discussion was that the MLEs of the AR and MA para- meters φ and θ are the minimizers of the sum
- f squares of the residuals:
S (φ, θ) =
X
w2
t (φ, θ) .
For a pure AR this is accomplished by a lin- ear regression of the series on its own past.
138 If there are MA terms then an iterative pro- cedure (Gauss-Newton) is necessary. As an example, you should be able to show that, un- der the usual assumption (what is it?), for an MA(1) model you get wt (θ) =
t−1
X
s=0
θsXt−s, so that S (θ) is a polynomial in θ, of degree 2 (t − 1), to be minimized. What would you use as the starting estimate, in the iterative procedure? How would the iterations then be carried out? Your calcula- tions should yield the iterative procedure θk+1 = θk −
P wt (θk) ˙
wt (θk)
P ˙
w2
t (θk)
.
139 — The usual estimate of the noise variance, ob- tain by modifying the MLE, is ˆ σ2
w
= S
³ˆ
φ, ˆ θ
´
# of residuals − # of parameters =
P (residuals)2
# of residuals − # of parameters. — What is the “Information Matrix”, and what role does it play in making inferences about the parameters in an ARMA model?