STATISTICS 479/503 TIME SERIES ANALYSIS PART II Doug Wiens April - - PDF document

statistics 479 503 time series analysis part ii doug
SMART_READER_LITE
LIVE PREVIEW

STATISTICS 479/503 TIME SERIES ANALYSIS PART II Doug Wiens April - - PDF document

STATISTICS 479/503 TIME SERIES ANALYSIS PART II Doug Wiens April 12, 2005 Contents II Time Domain Analysis 38 5 Lecture 5 . . . . . . . . . . . . . . . . . . . 39 6 Lecture 6 . . . . . . . . . . . . . . . . . . . 47 7 Lecture 7 . . .


slide-1
SLIDE 1

STATISTICS 479/503 TIME SERIES ANALYSIS PART II Doug Wiens April 12, 2005

slide-2
SLIDE 2

Contents

II Time Domain Analysis 38

5 Lecture 5 . . . . . . . . . . . . . . . . . . . 39 6 Lecture 6 . . . . . . . . . . . . . . . . . . . 47 7 Lecture 7 . . . . . . . . . . . . . . . . . . . 56 8 Lecture 8 . . . . . . . . . . . . . . . . . . . 62 9 Lecture 9 . . . . . . . . . . . . . . . . . . . 71

slide-3
SLIDE 3

37 10 Lecture 10 . . . . . . . . . . . . . . . . . . 78 11 Lecture 11 . . . . . . . . . . . . . . . . . . 87 12 Lecture 12 . . . . . . . . . . . . . . . . . . 97 13 Lecture 13 . . . . . . . . . . . . . . . . . . 115 14 Lecture 14 (Review) . . . . . . . . . . . . . 126

slide-4
SLIDE 4

38

Part II Time Domain Analysis

slide-5
SLIDE 5

39 5. Lecture 5

  • Suppose {Xt} is (weakly) stationary (and non-

deterministic - Xt is not a non-random function

  • f Xt−1, Xt−2, ..., i.e. it cannot be predicted ex-

actly from the past). Write Xt −µ as ˙

  • Xt. Then

(Wold’s Representation Theorem) we can repre- sent ˙ Xt as ˙ Xt = wt + ψ1wt−1 + ψ2wt−2 + ... =

X

k=0

ψkwt−k (with ψ0 = 1) (1) and

X

k=1

ψ2

k < ∞.

Salient feature: Linear function of past and present (not future) disturbances. Interpretation: con- vergence in mean square; i.e. E

⎡ ⎢ ⎣ ⎛ ⎝ ˙

Xt −

K

X

k=0

ψkwt−k

⎞ ⎠

2⎤

⎥ ⎦ → 0 as K → ∞.

slide-6
SLIDE 6

40 — The conditions ensure that we can take term- by-term expectations of series of the form

P∞

k=0 ψkXt−k, if the Xt−k have expectations:

E

⎡ ⎣

X

k=0

ψkXt−k

⎤ ⎦ =

X

k=0

ψkE

£Xt−k ¤ .

  • If (1) holds we say {Xt} is a linear process (also

called causal in the text, i.e. doesn’t depend on the future). Thus Wold’s Representation Theo- rem can be interpreted as saying that Stationarity ⇒ Linearity.

slide-7
SLIDE 7

41 The converse holds: Assume {Xt} linear; then (i) E [Xt] = µ + E

h ˙

Xt

i

= µ +

X

k=0

ψkE

£wt−k ¤ = µ.

(ii) COV [Xt, Xt+m] = E[ ˙ Xt ˙ Xt+m] = E

⎡ ⎣

X

k=0

ψkwt−k

X

l=0

ψlwt+m−l

⎤ ⎦

=

X

k=0 ∞

X

l=0

ψkψlE

£wt−kwt+m−l ¤

=

X

k=0 ∞

X

l=0

ψkψl

n

σ2

wI (l = m + k)

  • = σ2

w ∞

X

k=0

ψkψk+m. (In particular, V AR [Xt] = σ2

w

P∞

k=0 ψ2 k < ∞.)

Thus Stationarity ⇔ Linearity.

slide-8
SLIDE 8

42

  • Backshift operator:

B (Xt) = Xt−1, B2 (Xt) = B ◦ B (Xt) = B (Xt−1) = Xt−2,

  • etc. Then {Xt} linear ⇒

˙ Xt = ψ(B)wt for the characteristic polynomial ψ(B) = 1 + ψ1B + ψ2B2 + .... This is not really a polynomial, but if it is, i.e. ψk = 0 for k > q, we say {Xt} is a moving average series of order q, written MA(q). We usually write ψk = −θk. Then ˙ Xt = wt−θ1wt−1−θ2wt−2−...−θqwt−q = θ(B)wt for θ(B) = 1 − θ1B − ... − θqBq, the MA(q) characteristic polynomial.

  • The above convention, with θ(B) = 1 − θ1B −

... −θqBq, is as in ASTSA, and is consistent with an earlier version of the text and many other texts.

slide-9
SLIDE 9

43 The current version of the text uses ˙ Xt = wt + θ1wt−1 + θ2wt−2 + ... + θqwt−q and θ(B) = 1 + θ1B + ... + θqBq. To be consistent with ASTSA I’ll use the former.

  • Invertibility:

{Xt} is invertible if it can be rep- resented as ˙ Xt = φ1 ˙ Xt−1+φ2 ˙ Xt−2+...+wt, where

X

k=1

|φk| < ∞. Thus, apart from some noise, Xt is a function of the past history of the process. Generally, only invertible processes are of practical interest. In terms of the backshift operator, wt = ˙ Xt − φ1 ˙ Xt−1 − φ2 ˙ Xt−2 − ... = φ(B) ˙ Xt, where φ(B) = 1 − φ1B − φ2B2 − ... is the char- acteristic polynomial. If it is a true polynomial, i.e. if φj = 0 for j > p, we say {Xt} is an autore- gressive process of order p, i.e. AR(p). Then ˙ Xt = φ1 ˙ Xt−1 + φ2 ˙ Xt−2 + ... + φp ˙ Xt−p + wt.

slide-10
SLIDE 10

44 — When P∞

k=1 |φk| < ∞ we say the series is

absolutely summable. The importance of ab- solute summability is that such series can be re-arranged - they can be summed in any or- der. In contrast, P∞

k=1 (−1)k+1 k

= ln 2 ≈ .69, but the series is not absolutely summable:

P∞

k=1 1 k = ∞.

The original series can be re- arranged to give just about anything; for in- stance

µ

1 + 1 3 − 1 2

+

µ1

5 + 1 7 − 1 4

+ · · ·, in which two positive terms are always followed by a negative one, converges to something > .8.

  • When is a stationary (i.e. linear) process invert-

ible? Let {Xt} be linear, so ˙ Xt = ψ(B)wt and P∞

k=1 ψ2 k < ∞.

Suppose it is invertible. Then φ(B) ˙ Xt = wt; thus φ(B)ψ(B)wt = wt and φ(B)ψ(B) = 1. Thus φ(B) = 1/ψ(B), i.e. 1/ψ(B) has a power series expansion with absolutely summable coefficients. This makes ψ(B) quite special.

slide-11
SLIDE 11

45 — Example: MA(1); ψ(B) = 1 − θB for some θ. Then if invertible we must have 1/ψ(B) = 1 + θB + θ2B2 + ... =

X

j=0

θjBj; AND 1 + |θ| +

¯ ¯ ¯θ2¯ ¯ ¯ + ... < ∞;

this last point holds iff |θ| < 1. Note that the root of θ(B) = 0 is B = 1/θ, and then |θ| < 1 ⇔ |B| > 1, i.e. the MA(1) process with θ(B) = 1 − θB is invertible iff the root

  • f θ(B) = 0 satisfies |B| > 1.

— In general, a linear process Xt = ψ(B)wt is in- vertible iff all roots of the characteristic equa- tion ψ(B) = 0 satisfy |B| > 1 (complex modulus), i.e. they “lie outside the unit circle in the com- plex plane”. — The modulus of a complex number z = a + ib is |z| =

q

a2 + b2 (like the norm of a vector with coordinates (a, b)).

slide-12
SLIDE 12

46 — e.g. Xt = wt − 2wt−1 + 2wt−2; ψ(B) = 1 − 2B + 2B2 = 0 for B = .5 ± .5i; |B| = 1/ √ 2 ≈ .7 < 1. Non-invertible. — Similarly, an invertible process is stationary iff all roots of φ(B) = 0 lie outside the unit circle. This is called the stationarity condition. E.g. for an AR(1) the stationarity condition is |φ| < 1.

slide-13
SLIDE 13

47 6. Lecture 6

  • Review of previous lecture. Assume now for sim-

plicity that the mean is zero: — Linear process: Xt = ψ(B)wt, ψ(B) = 1 + ψ1B +ψ2B2+... with P∞

k=0 ψ2 k < ∞. Then

γ(m) = σ2

w ∞

X

k=0

ψkψk+m. — Linear + “ψk = 0 for k > q”: MA(q) process, γ(m) = 0 for m > q. Characteristic polyno- mial written as θ(B) = 1 − θ1B − θ2B2 − ... − θqBq. — Invertible process: φ(B)Xt = wt, φ(B) = 1 − φ1B − φ2B2 − ... with P

k |φk| < ∞.

Note this is really ˙ Xt; a non-zero mean can be accommodated as follows: wt = φ(B) ˙ Xt = ˙ Xt − φ1 ˙ Xt−1 − ... = (Xt − µ) − φ1 (Xt−1 − µ) − ... = {Xt − φ1Xt−1 − ...} − µ {1 − φ1 − φ2 − ...} = φ(B)Xt − α,

slide-14
SLIDE 14

48 if α = µφ(1). — Invertible + “φj = 0 for j > p”: AR(p) process. — Wold’s Theorem: Stationary ⇔ Linear. — A stationary process is invertible iff all roots

  • f ψ(B) = 0 lie outside the unit circle.

Thus an MA(q) is stationary (linear), not necessarily invertible. — An invertible process is stationary iff all roots

  • f φ(B) = 0 lie outside the unit circle.

Thus an AR(p) is invertible, not necessarily station- ary.

  • Example: MA(2).

Xt = wt − θ1wt−1 − θ2wt−2, θ(B) = 1 − θ1B − θ2B2. If θ2

1 + 4θ2 < 0 (so both roots are complex), then

invertibility requires |θ2| < 1. Suppose this is

slide-15
SLIDE 15

49 so. To invert: we need θ(B)φ(B) = 1, where φ(B) = 1 − φ1B − φ2B2 − ... , so 1 =

⎧ ⎨ ⎩ h

1 − θ1B − θ2B2i ·

h

1 − φ1B − φ2B2 − ... − φkBk − ...

i ⎫ ⎬ ⎭

= 1 − (φ1 + θ1) B − (φ2 − θ1φ1 + θ2) B2 − ... −

¡φk − φk−1θ1 − φk−2θ2 ¢ Bk + ... .

Matching coefficients: φ1 = −θ1, φ2 = θ1φ1 − θ2 = −θ2

1 − θ2,

φk = φk−1θ1 + φk−2θ2, k = 2, 3, ... .

slide-16
SLIDE 16

50

  • ARMA models are defined in operator notation

by φ(B)Xt = θ(B)wt; if φ(B) is an AR(p) char- acteristic polynomial and θ(B) an MA(q), we say {Xt} is an ARMA(p,q) process. It is station- ary (linear, causal) if Xt = ψ(B)wt for a series ψ(z) = P ψkzk, |z| ≤ 1, with square summable coefficients. Then the coefficients ψk are deter- mined from θ(z)/φ(z) = ψ(z). It can be shown that ψ(z) has the required properties only if all zeros of φ(z) lie outside the unit circle. Simi- larly an ARMA(p,q) is invertible only if all zeros

  • f θ(z) lie outside the unit circle. We also require

that the polynomials have no common factors.

  • Example: (Example 2.6 in text)

Xt = .4Xt−1 + .45Xt−2 + wt + wt−1 + .25wt−2 ⇒

³

1 − .4B − .45B2´ Xt =

³

1 + B + .25B2´ wt ⇒ (1 − .9B) (1 + .5B) Xt = (1 + .5B) (1 + .5B) wt ⇒ (1 − .9B) Xt = (1 + .5B) wt.

slide-17
SLIDE 17

51 Thus series is both stationary and invertible. It is ARMA(1,1), not ARMA(2,2) as it initially ap- peared. Students should verify that the above can be continued as Xt =

⎡ ⎣

X

j=0

(.9)j Bj · (1 + .5B)

⎤ ⎦ wt

=

h

1 + (.9 + .5)B + ... + (.9)j−1(.9 + .5)Bj + ...

i

wt = ψ(B)wt where ψ(z) = P ψkzk and ψ0 = 1, ψk = 1.4(.9)k−1.

  • Box-Jenkins methodology:
  • 1. Determine the theoretical ACF (and PACF, to

be defined) for these and other classes of time series models. Use the sample ACF/PACF to match the data to a possible model (MA(q), AR(p), etc.).

slide-18
SLIDE 18

52

  • 2. Estimate parameters using a method appropri-

ate to the chosen model, assess the fit, study the residuals. The notion of residual will re- quire a special treatment, for now think of them as Xt − ˆ Xt where, e.g., in an AR(p) model (Xt = φ1Xt−1+φ2Xt−2+...+φpXt−p+ wt) we have ˆ Xt = ˆ φ1Xt−1 + ˆ φ2Xt−2 + ... + ˆ φpXt−p. The residuals should then “look like” white noise (why?). If the fit is inad- equate revise steps 1. and 2.

  • 3. Finally use model to forecast.
  • We treat these three steps in detail.

Recall that for an MA(q), the autocovariance function is γ(m) =

(

σ2

w

Pq−m

k=0 θkθk+m, 0 ≤ m ≤ q,

m > q. The salient feature is that γ(m) = 0 for m > q; we look for this in the sample ACF. See Figures 2.1, 2.2.

slide-19
SLIDE 19

53 Figure 2.1. Sample ACF of simulated MA(3) series. Figure 2.2. Sample ACF of simulated MA(3) series.

slide-20
SLIDE 20

54

  • ACF of an AR(p) process:

Let j ≥ 0; assume process is stationary. Then wt = Xt −

p

X

i=1

φiXt−i ⇒ COV

⎡ ⎣Xt −

p

X

i=1

φiXt−i, Xt−j

⎤ ⎦

= COV

h

wt, Xt−j

i

⇒ γ(j) −

p

X

i=1

φiγ(j − i) = COV

h

wt, Xt−j

i

. Under the stationarity condition, Xt−j is a linear combination wt−j + ψ1wt−j−1 + ψ2wt−j−2 + .. with COV

h

wt, Xt−j

i

= COV

"

wt, wt−j + ψ1wt−j−1 +ψ2wt−j−2 + ..

#

= σ2

wI(j = 0),

thus γ(j) −

p

X

i=1

φiγ(j − i) =

(

σ2

w, j = 0,

j > 0. These are the “Yule-Walker” equations to be solved to obtain γ(j) for j ≥ 0, then γ(−j) = γ(j).

slide-21
SLIDE 21

55

  • Example: AR(1).

Yule-Walker equations are γ(j) − φγ(j − 1) =

(

σ2

w, j = 0,

j > 0. We get γ(0) = φγ(1) + σ2

w,

γ(j) = φγ(j − 1) for j > 0. In particular γ(0) = φγ(1) + σ2

w

= φ (φγ(0)) + σ2

w,

so γ(0) = σ2

w

1 − φ2. Note that 0 < γ(0) = V AR[Xt] < ∞ by the stationarity condition |φ| < 1. Iterating γ(j) = φγ(j − 1) gives γ(j) = φjγ(0), j = 1, 2, 3, ... . Thus ρ(j) = γ(j) γ(0) = φ|j|.

slide-22
SLIDE 22

56 7. Lecture 7

  • Difficult to identify an AR(p) from its ACF.
  • Suppose that a series is AR(1), and consider fore-

casting Xt from two previous values Xt−1,Xt−2: Xt = φXt−1 + wt, ˆ Xt = α1Xt−1 + α2Xt−2. One suspects that the “best” α’s will be α1 = φ, α2 = 0. This is in fact true, and is a property

  • f the “Partial Autocorrelation Function (PACF)”.
  • Assume µX = 0; consider the problem of mini-

mizing the function fm(α1,m, ..., αm,m) = E

h

{Xt − α1,mXt−1 − ... − αm,mXt−m}2i , which is the MSE when Xt is forecast by ˆ Xt = α1,mXt−1 + ... + αm,mXt−m.

slide-23
SLIDE 23

57 Let the minimizers be α∗

1,m, ..., α∗ m,m. The lag-

m PACF value, written φmm, is defined to be α∗

m,m.

  • It can also be shown that

φmm = CORR

h

Xt − ˆ Xt, Xt−m − ˆ Xt−m

i

, where each ˆ X denotes the best (i.e. minimum MSE) predictor which is a linear function of Xt−1, ..., Xt−m+1.

  • To compute:

Solve the m equations in m un- knowns 0 = −1 2 ∂fm ∂αj,m = E

h

Xt−j · {Xt − α1,mXt−1 − ... − αm,mXt−m}

i

=

h

γ(j) − α1,mγ(j − 1) − ... − αm,mγ(j − m)

i

, for j = 1, ..., m; i.e.

m

X

i=1

αi,mγ(j − i) = γ(j).

slide-24
SLIDE 24

58 Then m = 1 : φ11 = ρ(1), m = 2 : φ22 = ρ(2) − ρ2(1) 1 − ρ2(1) , etc.

  • Note that for an AR(1), ρ(j) = φj and so φ11 =

φ, φ22 = 0. See Figure 2.3. Figure 2.3. Sample PACF from simulated AR(1) series.

slide-25
SLIDE 25

59

  • In general, if {Xt} is AR(p) and stationary, then

φpp = φp and φmm = 0 for m > p. Proof: Write Xt = Pp

j=1 φjXt−j + wt, so for

m ≥ p fm(α1,m, ..., αm,m) = E

⎡ ⎢ ⎣ ⎧ ⎨ ⎩

wt + Pp

j=1

³

φj − αj,m

´

Xt−j − Pm

j=p+1 αj,mXt−j

⎫ ⎬ ⎭

2⎤

⎥ ⎦

= E

h

{wt + Z}2i , say, (where Z is uncorrelated with wt - why?), = σ2

w + E[Z2].

This is minimized if Z = 0 with probability 1, i.e. if αj,m = φj for j ≤ p and = 0 for j > p.

slide-26
SLIDE 26

60

  • Forecasting.

Given r.v.s Xt, Xt−1, ... (into the infinite past, in principle) we wish to forecast a future value Xt+l. Let the forecast be Xt

t+l.

We will later show that the “best” forecast is Xt

t+l = E

£Xt+l|Xt, Xt−1, ... ¤ ,

the conditional expected value of Xt+l given Xt = {Xs}t

s=−∞. Our general model identification ap-

proach, following Box/Jenkins, is then:

  • 1. Tentatively identify a model, generally by look-

ing at its sample ACF/PACF.

  • 2. Estimate the parameters (generally by the method
  • f Maximum Likelihood, to be covered later).

This allows us to estimate the forecasts Xt

t+l,

which depend on unknown parameters, by sub- stituting estimates to obtain ˆ Xt

t+l. We define

the residuals by ˆ wt = Xt − ˆ Xt−1

t

.

slide-27
SLIDE 27

61 The (adjusted) MLE of σ2

w is typically (in ARMA(p, q)

models) ˆ σ2

w =

1 T − 2p − q

T

X

t=p+1

ˆ w2

t .

The T −2p−q in the denominator is the num- ber of residuals used (T −p) minus the number

  • f ARMA parameters estimated (p + q).
  • 3. The residuals should “look like” white noise.

We study them, and apply various tests of whiteness. To the extent that they are not white, we look for possible alternate models.

  • 4. Iterate; finally use the model to forecast.
slide-28
SLIDE 28

62 8. Lecture 8

  • Conditional expectation.

Example: Randomly choose a stock from a listing. Y = price in one week, X = price in the previous week. To predict Y, if we have no information about X then the best (minimum mse) constant predictor of Y is E[Y ]. (Why? What mathematical problem is being formulated and solved here?) However, suppose we also know that X = x. Then we can improve our forecast, by using the forecast ˆ Y = E[Y |X = x], the mean price of all stocks whose price in the previous week was x.

  • In general, if (X, Y ) are any r.v.s, then E[Y |X =

x] is the expected value of Y , when the popula- tion is restricted to those pairs with X = x. We use the following facts about conditional expec- tation. (X, Y ) independent ⇒ E[Y |X = x] = E[Y ], E[X|X = x] = x, E[g(X)|X = x] = g(x), and more generally E[g(X, Y )|X = x] = E[g(x, Y )|X = x].

slide-29
SLIDE 29

63 In particular, E[f(X)g(Y )|X = x] = f(x)E[g(Y )|X = x].

  • Assume from now on that white noise is Normally

distributed. The important consequence is that these terms are now independent, not merely un- correlated, and so E [ws|wt, wt−1, ...] =

(

ws, s ≤ t, 0, s > t.

  • Put h(x) = E[Y |X = x].

This is a function of x; when it is evaluated at the r.v. X we call it h(X) = E[Y |X]. We have the Double Expectation Theorem: E {E[Y |X]} = E[Y ]. The inner expectation is with respect to the conditional distribution and the outer is with re- spect to the distribution of X; the theorem can be stated as E[h(X)] = E[Y ], where h(x) is as above.

slide-30
SLIDE 30

64 — Example: Y = house values, X = location (neighbourhood) of a house. E[Y ] can be

  • btained by averaging within neighbourhoods,

then averaging over neighbourhoods.

  • Similarly,

EX

n

EY |X [g(X, Y )|X]

  • = E [g(X, Y )] .
  • Minimum MSE forecasting. Consider forecasting

a r.v. Y (unobservable) using another r.v. X (or set of r.v.s); e.g. Y = Xt+l, X = Xt. We seek the function g(X) which minimizes the MSE MSE(g) = E

h

{Y − g(X)}2i . The required function is g(X) = E[Y |X] (= h(X)).

slide-31
SLIDE 31

65

  • Proof: We have to show that for any function g,

MSE(g) ≥ MSE(h). Write MSE(g) = E

h

{(Y − h(X)) + (h(X) − g(X))}2i = E

h

{Y − h(X)}2i + E

h

{h(X) − g(X)}2i +2E [(h(X) − g(X)) · (Y − h(X))] . We will show that the last term = 0; then we have MSE(g) = MSE(h)+E

h

{h(X) − g(X)}2i which exceeds MSE(h); equality iff g(X) = h(X) with probability 1. To establish the claim we evaluate the expected value in stages: E [(h(X) − g(X)) · (Y − h(X))] = EX

n

EY |X [(h(X) − g(X)) · (Y − h(X))|X]

  • .

The inner expectation is (why?) (h(X) − g(X) · EY |X [(Y − h(X))|X] = (h(X) − g(X) ·

n

EY |X [Y |X] − h(X)

  • = 0.
slide-32
SLIDE 32

66

  • The minimum MSE is

MSEmin = E

h

{Y − h(X)}2i = EX

n

EY |X

h

{Y − h(X)}2 |X

io

= EX {V AR[Y |X]} . We will show that V AR[Y ] = EX {V AR[Y |X]}+V AR[E{Y |X}], i.e. V AR[Y ] = MSEmin + V AR[h(X)]; thus MSEmin ≤ V AR[Y ]. V AR[Y ] is the MSE when Y is forecast by its mean and X is ignored;

  • ur result is then that using the information in

X never increases MSE, and results in a strict decrease as long as V AR[h(X)] > 0, i.e. h(x) is non-constant. Analogous to within and between breakdown in ANOVA (e.g. variation in house prices within and between neighbourhoods).

slide-33
SLIDE 33

67

  • Proof of claim:

V AR[Y ] = E

h

{Y − E[Y ]}2i = E

h

{Y − h(X) + (h(X) − E[Y ])}2i = E

h

{Y − h(X)}2i + E

h

{h(X) − E[Y ]}2i +2E [{Y − h(X)} {h(X) − E[Y ]}] = MSEmin + V AR[h(X)] +2EX

n

EY |X [{Y − h(X)} {h(X) − E[Y ]} |X]

  • .

The inner expectation is {h(X) − E[Y ]} · EY |X [Y − h(X)|X] = 0.

  • Students should verify that Y −E[Y |X] is uncor-

related with X.

slide-34
SLIDE 34

68

  • Assume {Xt} is stationary and invertible.

We forecast Xt+l by Xt

t+l = E

h

Xt+l|Xti , where Xt = {Xs}t

s=−∞.

Note that this forecast is ‘unbiased’ in that E[Xt

t+l] = E[Xt+l].

By the linearity we have that Xt+l can be represented as Xt+l =

X

k=0

ψkwt+l−k, (ψ0 = 1) so that Xt

t+l = ∞

X

k=0

ψkE[wt+l−k|Xt]. We have Xt = ψ(B)wt and (by invertibility) wt = φ(B)Xt where φ(B)ψ(B) = 1 determines φ(B). Thus conditioning on Xt is equivalent to condi- tioning on wt = {ws}t

s=−∞:

Xt

t+l

=

X

k=0

ψkE[wt+l−k|wt] where E[wt+l−k|wt] =

(

wt+l−k, if l ≤ k, 0,

  • therwise.

(Note that E

h

Xt+l−k|Xti = Xt+l−k if l ≤ k.)

slide-35
SLIDE 35

69 Thus the forecast is Xt

t+l = ∞

X

k=l

ψkwt+l−k, with forecast error and variance Xt+l − Xt

t+l

=

l−1

X

k=0

ψkwt+l−k, V AR[Xt+l − Xt

t+l] = σ2 w l−1

X

k=0

ψ2

k.

Since {wt} is normal, we have Xt+l − Xt

t+l ∼ N

⎛ ⎝0, σ2

w l−1

X

k=0

ψ2

k

⎞ ⎠

and so a 100(1 − α)% prediction (forecast) inter- val on Xt+l is Xt

t+l ± zα/2σw

v u u u t

l−1

X

k=0

ψ2

k.

Interpretation: the probability that Xt+l will lie in this interval is 1 − α.

slide-36
SLIDE 36

70

  • In practice we must solve for the ψk in terms of

the AR and MA parameters of {Xt}, then sub- stitute estimates of these parameters to obtain estimates ˆ ψk. Substituting these estimates into the expressions above results in the forecast ˆ Xt

t+l;

we also must use an estimate ˆ σ2

  • w. The residuals,
  • r innovations are

ˆ wt = Xt − ˆ Xt−1

t

and typically ˆ σ2

w = P ˆ

w2

t /(# of residuals - # of

parameters estimated).

slide-37
SLIDE 37

71 9. Lecture 9

  • Example 1: AR(1) (and stationary).

Xt = φXt−1 + wt. Xt

t+l

= E

h

Xt+l|Xti = E

h

φXt+l−1 + wt+l|Xti = φE

h

Xt+l−1|Xti + E

h

wt+l|wti =

(

φXt, l = 1, φXt

t+l−1, l > 1.

Iterating: Xt

t+l = φlXt for l ≥ 1.

The calculation of the forecast was easy (it al- ways is for an AR model); determining the fore- cast variance requires us to determine the ψk’s (since V AR[Xt+l−Xt

t+l] = σ2 w

Pl−1

k=0 ψ2 k). Usu-

ally this is done numerically; in the present case it can be done explicitly: (1 − φB) Xt = wt, so Xt = (1 − φB)−1 wt =

X

k=0

ψkwt−k,

slide-38
SLIDE 38

72 for ψk = φk. Then

l−1

X

k=0

ψ2

k = 1 − φ2l

1 − φ2 , leading to the forecast interval φlXt ± zα/2σw

v u u t1 − φ2l

1 − φ2 . Numerically we replace φ by its estimate ˆ φ; then ˆ Xt

t+l = ˆ

φlXt and the residuals are ˆ wt = Xt − ˆ Xt−1

t

= Xt − ˆ φXt−1 (t > 1). Note the similarity with wt = Xt − φXt−1. This illustrates the fact that the residual can also be

  • btained by writing wt in terms of the data and

parameters, and then replacing the parameters with estimates. The estimate of the variance

  • f the noise is

ˆ σ2

w = T

X

t=2

ˆ w2

t /(T − 2).

slide-39
SLIDE 39

73

  • Example 2. AR(p). Similar to Example 1, Xt =

Pp

i=1 φiXt−i + wt results in

ˆ Xt

t+l = p

X

i=1

ˆ φiXt

t+l−i,

where Xt

t+l−i = Xt+l−i if l ≤ i.

Now solve (numerically) Xt = (1/φ(B))wt = ψ(B)wt to get the ψk, then the ˆ ψk and the standard errors

  • f the forecasts.

The innovations are obtained from ˆ Xt−1

t

=

p

X

i=1

ˆ φiXt−1

t−i = p

X

i=1

ˆ φiXt−i to get ˆ wt = Xt −

p

X

i=1

ˆ φiXt−i, with ˆ σ2

w = T

X

t=p+1

ˆ w2

t /(T − 2p).

slide-40
SLIDE 40

74

  • Example 3.

MA(1) (and invertible). Xt = wt − θwt−1 = (1 − θB)wt ⇒ wt =

X

k=0

θkXt−k. We make the approximation X0 = w0 = 0, and then wt =

t−1

X

k=0

θkXt−k. Now Xt

t+l = E

h

wt+l − θwt+l−1|wti =

(

−θwt, l = 1, 0, l > 1, with wt obtained from the preceding equation. This gives residuals ˆ wt =

t−1

X

k=0

ˆ θkXt−k, and ˆ σ2

w = T

X

t=1

ˆ w2

t /(T − 1)

slide-41
SLIDE 41

75 Trivially, since ψ0 = 1, ψ1 = −θ and ψk = 0 for k > 1, we have

l−1

X

k=0

ψ2

k =

(

1, l = 1, 1 + θ2, l > 1. The prediction intervals are

⎧ ⎨ ⎩

−ˆ θ ˆ wt ± zα/2ˆ σw, l = 1, 0 ± zα/2ˆ σw

q

1 + ˆ θ2, l > 1.

  • Students should write out the procedure for an

invertible MA(q) model.

  • Example 4.

ARMA(1,1), stationary and invert- ible. In general, when there is an MA compo- nent we make the approximation X0 = w0 = 0. The model is (1 − φB)Xt = (1 − θB)wt, i.e. Xt = φXt−1 + wt − θwt−1, leading to Xt

t+l

= φXt

t+l−1 + wt t+l − θwt t+l−1

=

(

φXt − θwt, l = 1, φXt

t+l−1,

l > 1.

slide-42
SLIDE 42

76 To obtain a value for wt we write wt = (1 − φB)(1 − θB)−1Xt = (1 − φB)

X

k=0

θkBk · Xt =

⎛ ⎝1 +

X

k=1

θk−1 (θ − φ) Bk

⎞ ⎠ Xt

with approximation wt = Xt +

t−1

X

k=1

θk−1 (θ − φ) Xt−k. For the forecast variance we reverse φ and θ in the above: Xt = (1 − θB)(1 − φB)−1wt =

X

k=0

ψkwt−k with ψk = φk−1 (φ − θ) (and ψ0 = 1); thus V AR

h

Xt+l − Xt

t+l

i

= σ2

w l−1

X

k=0

ψ2

k

= σ2

w

"

1 + (φ − θ)2 1 − φ2(l−1) 1 − φ2

#

.

slide-43
SLIDE 43

77 Students should verify that obtaining the residuals from ˆ wt = Xt − ˆ Xt−1

t

results in the same expres- sion as substituting estimates into the expression for wt given above, i.e. ˆ wt = Xt +

t−1

X

k=1

ˆ θk−1 ³ˆ θ − ˆ φ

´

Xt−k. Then ˆ σ2

w = PT t=2 ˆ

w2

t /(T − 3).

  • In general, in an ARMA(p, q) model, the assump-

tion X0 = 0 is necessary if we are to be able to calculate any of the residuals. The resulting ex- pression - of the form ˆ wt = Pt−1

k=0 ˆ

αkXt−k - could also be used to calculate the first p residuals. So there is a trade-off: we could get numerical but approximate values for ˆ w1, ..., ˆ wp, or we could just not bother and calculate the remaining residuals more accurately. Here we take the latter ap- proach. This gives ˆ σ2

w = T

X

t=p+1

ˆ w2

t /(T − 2p − q).

slide-44
SLIDE 44

78 10. Lecture 10

  • Estimation.

One method is the method of mo- ments, in which we take expressions relating pa- rameters to expected values, replace the expected values by series averages, then solve for the un- known parameters. — e.g. E[Xt] = µ becomes T −1 PT

t=1 xt = ˆ

µ. — e.g. For an AR(1) model we could replace γ(k) by the sample autocovariance ˆ γ(k) in the Yule-Walker equations, then solve them as be- fore to get ˆ γ(0) = ˆ σ2

w

1 − ˆ φ2, ˆ γ(1) = ˆ φˆ γ(0), yielding ˆ φ = ˆ ρ(1), ˆ σ2

w

= ˆ γ(0)

³

1 − ˆ φ2´ .

slide-45
SLIDE 45

79 Recall we previously used the (adjusted Max- imum Likelihood) estimate ˆ σ2

w,MLE = T

X

t=2

ˆ w2

t /(T − 2).

Students should show that the difference be- tween these two estimates is of the order (i.e. a multiple of) 1/T; in this sense the two es- timates are asymptotically (i.e. as T → ∞) equivalent. — The same technique applied to the MA(1) model starts with ρ(1) = −θ/

³

1 + θ2´ ( |ρ(1)| < 1/2 by invertibility: |θ| < 1), then we solve ˆ ρ(1) = −ˆ θ/

³

1 + ˆ θ2´ . If |ˆ ρ(1)| < 1/2 there is a real root ˆ θ with |ˆ θ| < 1 and we use it. (Otherwise |ˆ θ| = 1 and the estimated model is not invertible.) But even when |ˆ θ| < 1 the estimate can be quite inefficient (highly varied) relative to the MLE, which we consider next.

slide-46
SLIDE 46

80

  • Maximum Likelihood Estimation.

We observe

x = (x1, ..., xT)0; suppose the joint probability

density function (pdf) is f(x|α) for a vector α = (α1, ..., αp)0 of unknown parameters. E.g. if the Xt are independent N(µ, σ2) the joint pdf is

T

Y

t=1

⎧ ⎨ ⎩ ³

2πσ2´−1/2 e−(xt−µ)2

2σ2

⎫ ⎬ ⎭

=

³

2πσ2´−T/2 e−

PT

t=1(xt−µ)2 2σ2

. When evaluated at the numerical data this is a function of α alone, denoted L (α|x) and known as the Likelihood function. The value ˆ

α which

maximizes L (α|x) is known as the Maximum Likelihood Estimator (MLE). Intuitively, the MLE makes the observed data “most likely to have oc- curred”. — We put l(α) = ln L (α|x), the log-likelihood, and typically maximize it (equivalent to max-

slide-47
SLIDE 47

81 imizing L) by solving the likelihood equations

³˙

l(α) =

´ ∂

∂αl(α) = 0, i.e. ∂ ∂αj l(α) = 0, j = 1, ..., p. The vector ˙ l(α) is called the gradient. — The MLE has attractive large sample proper- ties. With α0 denoting the true value, we typically have that √ T (ˆ

α − α0) has a lim-

iting (as T → ∞) normal distribution, with mean 0 and covariance matrix

C = I−1(α0)

where I(α0) is the information matrix defined below. — The role of the covariance matrix is that if a random vector Y = (Y1, ..., Yp)0 has covari- ance matrix C, then COV

h

Yj, Yk

i

= Cjk; in particular V AR[Yj] = Cjj.

slide-48
SLIDE 48

82 — The information matrix is given by

I(α0) =

lim

T→∞

½ 1

T E

l(α0)˙ l(α0)0i¾ ; a more convenient and equivalent form (for the kinds of models we will be working with) is

I(α0) =

lim

T→∞

½ 1

T E

h

−¨ l(α0)

, where ¨ l(α) is the Hessian matrix with (j, k)th element ∂2l(α)/∂αj∂αk. — To apply these results we estimate I(α0) by

ˆ I = I(ˆ α).

Denote the (j, k)th element of ˆ

I−1 by ˆ Ijk.

Then the normal approximation is that √ T

³

ˆ αj − αj

´

is asymptotically normally distributed with mean zero and variance estimated by ˆ

Ijj, so that

ˆ αj − αj sj ≈ N(0, 1), where sj =

s

ˆ Ijj

T .

slide-49
SLIDE 49

83 Then, e.g., the p-value for the hypothesis H0 : αj = 0 against a two-sided alternative is p = 2P

Ã

Z >

¯ ¯ ¯ ¯ ¯

ˆ αj sj

¯ ¯ ¯ ¯ ¯ !

, supplied on the ASTSA printout.

  • Example 1.

AR(1) with a constant: Xt = φ0 + φ1Xt−1+wt. As is commonly done for AR mod- els we will carry out an analysis conditional on X1; i.e. we act as if X1 is not random, but is the con- stant x1. We can then carry out the following steps:

  • 1. Write out the pdf of X2, ..., XT .
  • 2. From 1. the log-likelihood is l(α) = ln L (α|x),

where α =

³

φ0, φ1, σ2

w

´0.

  • 3. Maximize l(α) to obtain the MLEs

³ˆ

φ0, ˆ φ1, ˆ σ2

w

´

.

  • 4. Obtain the information matrix and its esti-

mated inverse.

slide-50
SLIDE 50

84

  • Step 1.

(Assuming normally distributed white noise.) Transformation of variables. If the pdf

  • f w2, ..., wT is g(w|α) and we write the w’s in

terms of the X’s: wt = Xt − φ0 − φ1Xt−1 then the pdf of X2, ..., XT is f(x|α) =g(w|α)

¯ ¯ ¯ ¯ µ∂w

∂x

¶¯ ¯ ¯ ¯+

where

³∂w

∂x

´

is the matrix of partial derivatives, with

µ∂w

∂x

jk

= ∂wj ∂xk , and |·|+ is the absolute value of the determinant. On the right hand side g(w|α) is evaluated by replacing the w’s with their expressions in terms

  • f the x’s.

In this AR(1) example, g(w|α) =

³

2πσ2

w

´−(T−1)/2 e

PT

t=2 w2 t 2σ2 w

slide-51
SLIDE 51

85 and ∂w ∂x =

⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

1 · · · −φ 1 ... . . . −φ ... ... . . . ... ... 1 · · · −φ 1

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

with determinant = 1 (why?). Thus f(x|α) =

³

2πσ2

w

´−(T−1)/2 e

PT

t=2(xt−φ0−φ1xt−1)2 2σ2 w

. This determinant will always = 1 if we can write wt as Xt+a function of Xt−1, ..., X1. But (i) this can always be done in AR models, and (ii) in models with MA parts to them we assume in- vertibility + “X0 = 0” , so that this can be done there as well. So for all models considered in this course, |∂w/∂x|+ = 1.

slide-52
SLIDE 52

86

  • Step 2.

l(α) = ln

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ ³

2πσ2

w

´−(T−1)/2 e

PT

t=2(xt−φ0−φ1xt−1)2 2σ2 w

⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭

= −T − 1 2 ln σ2

w −

PT

t=2 (xt − φ0 − φ1xt−1)2

2σ2

w

+const. = −T − 1 2 ln σ2

w − S(φ0, φ1)

2σ2

w

+ const., say.

slide-53
SLIDE 53

87 11. Lecture 11

  • Step 3.

To maximize l over φ0, φ1 we minimize S(φ0, φ1). Consider a regression model xt = φ0 + φ1xt−1 + error, t = 2, ..., T. (*) In a regression model yi = φ0 + φ1xi + error, i = 1, ..., n the least squares estimates of the slope and inter- cept minimize P (yi − φ0 − φ1xi)2 and are ˆ φ0 = ¯ y − ˆ φ1¯ x, ˆ φ1 =

P (xi − ¯

x) (yi − ¯ y)

P (xi − ¯

x)2

Ã

= ˆ γY X(0) ˆ γX(0)

!

. Thus the minimizers of S are the LSEs in the re- gression model (*); they are obtained numerically by doing the regression and formulas for them can (and should, by the students) be written out. Students should verify that ˆ φ1 ≈ ˆ ρ(1), ˆ φ0 ≈ ¯ x

³

1 − ˆ φ1

´

.

slide-54
SLIDE 54

88 The likelihood equation for σ2

w is

0 = ∂l ∂σ2

w

= −T − 1 2σ2

w

+ S(ˆ φ0, ˆ φ1) 2σ4

w

, satisfied by ˆ σ2

w

= S(ˆ φ0, ˆ φ1) T − 1 =

PT

t=2

³

xt − ˆ φ0 − ˆ φ1xt−1

´2

T − 1 =

PT

t=2 ˆ

w2

t

T − 1 . A usual adjustment for bias is to replace T −1 by T − 3 = (T − 1) − 2 = # of residuals - number

  • f parameters estimated.
slide-55
SLIDE 55

89

  • Step 4.

Information matrix: ˙ l =

⎛ ⎜ ⎜ ⎜ ⎜ ⎝

− 1

2σ2

w

∂S ∂φ0

− 1

2σ2

w

∂S ∂φ1

−T−1

2σ2

w +

S 2σ4

w

⎞ ⎟ ⎟ ⎟ ⎟ ⎠

, −¨ l =

⎛ ⎜ ⎜ ⎜ ⎜ ⎝

1 2σ2

w

∂2S ∂φ0∂φ0 1 2σ2

w

∂2S ∂φ1∂φ0

− 1

2σ4

w

∂S ∂φ0

1 2σ2

w

∂2S ∂φ1∂φ1

− 1

2σ4

w

∂S ∂φ1

∗ ∗ −T−1

2σ4

w + S

σ6

w

⎞ ⎟ ⎟ ⎟ ⎟ ⎠

. Calculate ∂S ∂φ0 = −2

T

X

t=2

(xt − φ0 − φ1xt−1) = −2

T

X

t=2

wt, with expectation 0, ∂S ∂φ1 = −2

T

X

t=2

xt−1 (xt − φ0 − φ1xt−1) = −2

T

X

t=2

xt−1wt,

slide-56
SLIDE 56

90 with expectation (why?) −2

T

X

t=2

COV [Xt−1,wt] = 0, ∂2S ∂φ0∂φ0 = 2 (T − 1) , with expectation 2 (T − 1) , ∂2S ∂φ1∂φ0 = 2

T

X

t=2

xt−1, with expectation 2 (T − 1) µ, ∂2S ∂φ1∂φ1 = 2

T

X

t=2

x2

t−1, with expectation

2 (T − 1)

³

γ(0) + µ2´ . Then using E[S] = E[

T

X

t=2

w2

t ] = (T − 1)σ2 w,

slide-57
SLIDE 57

91 we get 1 T E[−¨ l] = T − 1 T

⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

1 σ2

w

µ σ2

w

∗ (γ(0)+µ2)

σ2

w

∗ ∗ − 1

2σ4

w + σ2 w

σ6

w =

1 2σ4

w

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

T→∞

1 σ2

w

⎛ ⎜ ⎜ ⎝

1 µ µ γ(0) + µ2

1 2σ2

w

⎞ ⎟ ⎟ ⎠

= 1 σ2

w

Ã

A 0T

1/(2σ2

w)

!

= I(α0) The inverse is

I−1(α0) = σ2

w

Ã

A−1 0T

2σ2

w

!

, where

A−1

= 1 γ(0)

Ã

γ(0) + µ2 −µ −µ 1

!

. Thus, e.g. the normal approximation for ˆ φ1 is that ˆ φ1 − φ1 ≈ N

Ã

0, σ2

w

Tγ(0) = 1 − φ2

1

T

!

,

slide-58
SLIDE 58

92 with standard error s2(ˆ φ1) = 1 − ˆ φ2

1

T and

ˆ φ1−φ1 s(ˆ φ1) ≈ N(0, 1). A 100(1−α)% confidence

interval is ˆ φ1 ± zα/2s(ˆ φ1).

  • In general, for an AR(p):

Xt = φ0 +

p

X

i=1

φiXt−i + wt we minimize S(φ) =

T

X

t=p+1

⎛ ⎝xt − φ0 −

p

X

i=1

φixt−i

⎞ ⎠

2

by fitting a regression model xt = φ0 +

p

X

i=1

φixt−i + error for t = p + 1, ..., T. The resulting LSEs are ˆ

φ

and the associated mean square of the residuals

slide-59
SLIDE 59

93 is ˆ σ2

w =

S(ˆ

φ)

T − 2p − 1. The large-sample standard errors are obtained by ASTSA and appear on the printout. (Time do- main > Autoregression or Time domain > ARIMA; then answer the question that appear. More on this later.)

  • Example 2.

ARMA(p,q). Model is Xt −

p

X

j=1

φjXt−j = wt −

q

X

k=1

θkwt−k. Assume that Xt = wt = 0 and solve successively for the wt’s in terms of the Xt’s: wt = Xt −

p

X

j=1

φjXt−j +

q

X

k=1

θkwt−k; w1 = X1, w2 = X2 − φ1X1 + θ1w1, etc.

slide-60
SLIDE 60

94 In this way we write

³

wp+1, ..., wT

´

in terms of

³

xp+1, ..., xT

´

. The Jacobian is again = 1 and so f(x|α) =

³

2πσ2

w

´−(T−p)/2 e

−S(φ,θ)

2σ2 w ,

where S(φ, θ) = PT

t=p+1 w2 t (φ, θ).

(It is usual to omit the first p wt’s, since they are based

  • n so little data and on the assumption men-

tioned above. Equivalently we are conditioning

  • n them.)

Now S(φ, θ) is minimized numeri- cally to obtain the MLEs ˆ

φ, ˆ θ.

The MLE of σ2

w

is obtained from the likelihood equation and is S(ˆ

φ, ˆ θ)/(T − p); it is often adjusted for bias to

give ˆ σ2

w =

S(ˆ

φ, ˆ θ)

T − 2p − q.

slide-61
SLIDE 61

95

  • The matrix

I(α0) =

lim

T→∞

½ 1

T E

h

−¨ l(α0)

is sometimes estimated by the “observed infor- mation matrix” 1

T

³

−¨ l(ˆ

α)

´

evaluated at the data {xt}. This is numerically simpler.

  • Students should write out the procedure for an

MA(1) model.

  • The numerical calculations also rely on a modifi-

cation of least squares regression. Gauss-Newton algorithm: We are to minimize S(ψ) =

X

t

w2

t (ψ),

where ψ is a p-dimensional vector of parameters. The idea is to set up a series of least squares re- gressions converging to the solution. First choose an initial value ψ0. (ASTSA will put all compo- nents = .1 if “.1 guess” is chosen. Otherwise, it will compute method of moments estimates, by equating the theoretical and sample ACF and PACF values. This takes longer.)

slide-62
SLIDE 62

96

  • Now expand wt(ψ) around ψ0 by the Mean Value

Theorem: wt(ψ) ≈ wt(ψ0) + ˙ w0

t(ψ0) (ψ − ψ0)

(1) = “yt − z0

tβ”,

where yt = wt(ψ0), zt = − ˙ wt(ψ0), β = ψ − ψ0. Now S(ψ) ≈

X

t

³

yt − z0

´2

is minimized by regressing {yt} on {zt} to get the LSE ˆ

β1.

We now set

ψ1 = ˆ β1 + ψ0,

expand around ψ1 (i.e. replace ψ0 by ψ1 in (1)),

  • btain a revised estimate ˆ

β2.

Iterate to conver- gence.

slide-63
SLIDE 63

97 12. Lecture 12

  • A class of nonstationary models is obtained by

taking differences, and assuming that the differ- enced series is ARMA(p,q): ∇Xt = Xt − Xt−1 = (1 − B)Xt, ∇2Xt = ∇(∇Xt) = (1 − B)2Xt, etc. We say {Xt} is ARIMA(p,d,q) (“Integrated ARMA”) if ∇dXt is ARMA(p,q). If so, φ(B)(1 − B)dXt = θ(B)wt for an AR(p) polynomial φ(B) and an MA(q) polynomial θ(B). Since φ(B)(1 − B)d has roots

  • n the unit circle, {Xt} cannot be stationary.

The differenced series

n

∇dXt

  • is the one we an-

alyze.

slide-64
SLIDE 64

98

  • It may happen that the dependence of a series
  • n its past is strongest at multiples of the sam-

pling unit, e.g. monthly economic data may ex- hibit strong quarterly or annual trends. To model this, define seasonal AR(P) and MA(Q) charac- teristic polynomials Φ(Bs) = 1 − Φ1Bs − Φ2B2s − ... − ΦPBPs, Θ(Bs) = 1 − Θ1Bs − Θ2B2s − ... − ΘQBQs. A seasonal ARMA(P,Q) model, with season s, is defined by Φ(Bs)Xt = Θ(Bs)wt. This can be combined with the hierarchy of or- dinary ARMA models, and with differencing, to give the full ARIMA(p,d,q)×(P,D,Q)s model de- fined by Φ(Bs)φ(B)(1−Bs)D(1−B)dXt = Θ(Bs)θ(B)wt.

slide-65
SLIDE 65

99 — Example: the ARIMA(0,1,1)×(0,1,1)12 model has Φ(Bs) = 1, φ(B) = 1, d = D = 1, Θ(Bs) = 1 − ΘB12, θ(B) = 1 − θB. Thus (1 − B12)(1 − B)Xt =

³

1 − ΘB12´ (1 − θB) wt. Expanding: Xt = Xt−1 + Xt−12 − Xt−13 +wt − θwt−1 − Θwt−12 + Θθwt−13. This model often arises with monthly economic data. — The analysis of the ACF and PACF proceeds along the same lines as for the previous models (see below).

slide-66
SLIDE 66

100

  • Choosing an appropriate model.

Some guiding properties of the ACF/PACF: — Nonstationarity: ACF drops off very slowly (a root of the AR characteristic equation with |B| near 1 will do this too); PACF large (in absolute value) at 1 (but only at 1 could in- dicate AR(1)). Try taking differences ∇dXt, d = 1, 2. Rarely is d > 2. Don’t be too hasty to take differences; try ARMA models first. — Seasonal nonstationarity: ACF zero except at lags s, 2s, ...; decays slowly,or PACF very large at s. Try ∇d

sXt = (1 − Bs)dXt.

— AR(p) behaviour: PACF zero for m > p. — Seasonal AR(P): PACF zero except at m = s, 2s, ..., Ps. — MA(q): ACF zero for m > q. — Seasonal MA(Q): ACF zero except at m = s, 2, ..., Qs.

slide-67
SLIDE 67

101

  • Note:

Sometimes one fits an MA(q) model, or an ARMA(p,q) model with q > 0, and finds that most residuals are of the same sign. This is gen- erally a sign that µX 6= 0 (recall an MA(q) has mean zero). A remedy is to fit a constant as well (possible in ASTSA only in the AR option) or to subtract the average ¯ x from the series before the analysis is carried out.

  • Principle of Parsimony:

We seek the simplest model that is adequate. We can always “improve” the fit by throwing in extra terms, but then the model might only fit these data well.

slide-68
SLIDE 68

102

  • See Figures 2.4 - 2.11.

Example: U.S. Federal Reserve Board Production Index - an index of eco- nomic productivity. Plots of data and ACF/PACF clearly indicate nonstationarity. ACF of ∇Xt in- dicates seasonal (s = 12) nonstationarity. ACF and PACF of ∇12∇Xt shows possible models ARIMA(p = 1 − 2, d = 1, q = 1 − 4) ×(P = 2, D = 1, Q = 1)s=12. The ARIMA Search facility, using BIC, picks out ARIMA(1, 1, 0) × (0, 1, 1)12. Note q = P = 0; the other AR and MA features seem to have ac- counted for the seeming MA and seasonal AR(2) behaviour.

slide-69
SLIDE 69

103 Figure 2.4. U.S. Federal Reserve Board Production Index (frb.asd in ASTSA). Non-stationarity of mean is obvious.

slide-70
SLIDE 70

104 Figure 2.5. ACF decays slowly, PACF spikes at m = 1. Both indicate nonstationarity. Figure 2.6. ∇Xt; nonstationary variance exhibited.

slide-71
SLIDE 71

105 Figure 2.7. ∇Xt; ACF shows seasonal nonstationarity. Figure 2.8. ∇12∇Xt. Peak in ACF at m = 12 indicates seasonal (s = 12) MA(1). PACF indicates AR(1) or AR(2) and (possibly) seasonal AR(2).

slide-72
SLIDE 72

106 Figure 2.9. Time domain > ARIMA search applied to Xt. 3 × 5 × 3 × 3 = 135 possible models to be fitted.

slide-73
SLIDE 73

107 Figure 2.10. ARIMA search with BIC criterion picks

  • ut ARIMA(1,1,0)×(0,1,1)12.

The (default) AICc criterion picks out ARIMA(0,1,4)×(2,1,1)12. One could argue for either one.

slide-74
SLIDE 74

108 Figure 2.11. Output for “best” model. Save the residuals for further analysis.

slide-75
SLIDE 75

109

  • There are several “information criteria”. All seek

to minimize the residual variation while imposing penalties for nonparsimonious models. Let K be the number of AR and MA parameters fitted, and let ˆ σ2

w(K) be the estimated variance of the

residual noise. Then AIC(K) = ln ˆ σ2

w(K) + 2K

T , AICc(K) = ln ˆ σ2

w(K) +

T + K T − K − 2, (small sample modification), BIC(K) = ln ˆ σ2

w(K) + K ln T

T , etc.

  • See Figures 2.12 - 2.15. Residual analysis. Save

the residuals. Plot their ACF/PACF; none of the values should be significant (in principle!). These plots were examined but are not included here; all satisfactory. Also plot residuals against ∇12∇Xt−1; they should be approximately uncor- related and not exhibit any patterns. (A conse- quence of the fact that Y − E[Y |X] is uncorre- lated with X is that Xt − Xt−1

t

is uncorrelated

slide-76
SLIDE 76

110 with Xs for s < t, i.e. except for the fact that the residuals use estimated parameters, they are uncorrelated with the predictors. This assumes stationarity, so here we use the differenced series.) Click on Graph > Residual tests. Figure 2.12. Residuals vs. time.

slide-77
SLIDE 77

111 Figure 2.13. Store the residuals ˆ wt = { ˆ w15, ..., ˆ w372} and the differenced predictors ∇12∇Xt = {∇12∇X15, ..., ∇12∇X372} (see Transform > Take a subset). Then use Graph > 2D Scatterplot and “detail view” with the ˆ wt as series 1 and ∇12∇Xt as series 2, and “lag = 1” to get this plot. Use “Full view”- or merely double click on this plot - to get plots at a variety of lags, as described in the ASTSA manual.

slide-78
SLIDE 78

112 Figure 2.14. Graphical output from Graph > Residual tests.

slide-79
SLIDE 79

113 Figure 2.15. Printed output from Graph > Residual tests.

slide-80
SLIDE 80

114

  • Cumulative spectrum test.

We will see later that if the residuals { ˆ wt} are white noise, then their “spectrum” f(ν) is constantly equal to the variance, i.e. is ‘flat’. Then the integrated, or cumulative spectrum, is linear with slope = vari- ance; here it is divided by ˆ σ2

w/2 and so should

be linear with a slope of 2. Accounting for ran- dom variation, it should at least remain in a band around a straight line through the origin, with a slope of 2. The graphical output confirms this and the printed output gives a non-significant p- value to the hypothesis of a flat spectrum.

slide-81
SLIDE 81

115 13. Lecture 13

  • Box-Pierce test. Under the hypothesis of white-

ness we expect ˆ ρw(m) to be small in absolute value for all m; a test can be based on Q = T(T + 2)

M

X

m=1

ˆ ρ2

w(m)

T − m, which is approximately ∼ χ2

M−K under the null

hypothesis. The p-value is calculated and re- ported for M − K = 1, 20.

  • Fluctuations test (= Runs test).

Too few or too many fluctuations, i.e. changes from an up- ward trend to a downward trend or vice versa) constitute evidence against randomness. E.g. an AR(1) (ρ(1) = φ) with φ near +1 will have few fluctuations, φ near −1 will result in many.

slide-82
SLIDE 82

116

  • Normal scores (Q-Q) test.

The quantiles of a distribution F = Fw are the values F −1(q), i.e. the values below which w lies with probability q. The sample versions are the order statistics w(1) < w(2) < ... < w(T): the probability of a value w falling at or below w(t) is estimated by t/T, so w(t) can be viewed as the (t/T)th sam- ple quantile. If the residuals are normal then a plot of the sample quantiles against the standard normal quantiles should be linear, with intercept equal to the mean and slope equal to the standard deviation (follows from F −1(q) = µ+σΦ−1(q) if F is the N(µ, σ2) df; students should verify this identity). The strength of the linearity is mea- sured by the correlation between the two sets of quantiles; values too far below 1 lead to rejection

  • f the hypothesis of normality of the white noise.
slide-83
SLIDE 83

117

  • The residuals from the more complex model cho-

sen by AICc looked no better.

  • A log or square root transformation of the original

data sometimes improves normality; logging the data made no difference in this case.

  • In contrast, if we fit an ARIMA(0, 1, 0)×(0, 1, 1)12,

then the ACF/PACF of the residuals clearly shows the need for the AR(1) component - see Figures 2.16, 2.17. Note also that the largest residual is now at t = 324 - adding the AR term has ex- plained the drop at this point to such an extent that it is now less of a problem than the smaller drop at t = 128.

slide-84
SLIDE 84

118 Figure 2.16. ACF and PACF of residuals from ARIMA(0, 1, 0) × (0, 1, 1)12 fit.

slide-85
SLIDE 85

119 Figure 2.17. Residual tests for ARIMA(0, 1, 0) × (0, 1, 1)12 fit.

slide-86
SLIDE 86

120 Forecasting in the FRB example. Fitted model is (1 − φB) ∇∇12Xt =

³

1 − ΘB12´ wt; (1) this is expanded as Xt = (1 + φ) Xt−1 − φXt−2 + Xt−12 − (1 + φ) Xt−13 + φXt−14 + wt − Θwt−12.(2) Conditioning on Xt gives Xt

t+l = (1 + φ) Xt t+l−1 − φXt t+l−2 + Xt t+l−12

− (1 + φ) Xt

t+l−13 + φXt t+l−14 + wt t+l − Θwt t+l−12.

We consider the case 1 ≤ l ≤ 12 only; l = 13 left as an exercise. Note that the model is not stationary but seems to be invertible (since |ˆ Θ| = .6962 < 1). Thus wt can be expressed in terms of Xt and, if we assume w0 = X0 = 0, we can express Xt in terms of wt by iterating (2). Then conditioning on Xt is equivalent to conditioning on wt. Thus Xt

t+l−k = Xt+l−k for

k = 12, 13, 14 and wt

t+l = 0:

Xt

t+l

= (1 + φ) Xt

t+l−1 − φXt t+l−2 + Xt+l−12

− (1 + φ) Xt+l−13 + φXt+l−14 − Θwt

t+l−12.

slide-87
SLIDE 87

121 These become l = 1 : Xt

t+1 = (1 + φ) Xt − φXt−1 + Xt−11 −

(1 + φ) Xt−12 + φXt−13 − Θwt−11, (3) l = 2 : Xt

t+2 = (1 + φ) Xt t+1 − φXt + Xt−10 −

(1 + φ) Xt−11 + φXt−12 − Θwt−10, 3 ≤ l ≤ 12 : from previous forecasts and wt−9, ..., wt. To get wt, .., wt−11 we write (1) as wt = Θwt−12 +

(

Xt −

"

(1 + φ) Xt−1 − φXt−2+ Xt−12 − (1 + φ) Xt−13 + φXt−14

#)

= Θwt−12 + ft, say (4) and calculate successively, using the assumption w0 = X0 = 0, w1 = f1, w2 = f2, , ..., w12 = f12, w13 = Θw1 + f13, w14 = Θw2 + f14, etc.

slide-88
SLIDE 88

122 To get the residuals ˆ wt = Xt − ˆ Xt−1

t

put t = t − 1 in (3): Xt−1

t

= (1 + φ) Xt−1 − φXt−2 + Xt−12 − (1 + φ) Xt−13 + φXt−14 − Θwt−12, so ˆ wt = Xt − ˆ Xt−1

t

= (4) with all parameters esti- mated and ˆ σ2

w = T

X

t=15

ˆ w2

t /(T − 16).

(The first 14 residuals employ the assumption w0 = X0 = 0 and are not used.) The forecast variances and prediction intervals require us to write Xt = ψ(B)wt, and then PI = ˆ Xt

t+l ± zα/2ˆ

σw

s X

1≤k<l

ˆ ψ2

k.

The model is not stationary and so the coefficients of ψ(B) will not be absolutely summable, however under the assumption that w0 = 0, only finitely many of the ψk are needed. Then in (2),

(

(1 − φB) (1 − B)

³

1 − B12´ · (1 + ψ1B + ψ2B2 + ... + ψkBk + ...)

)

wt =

³

1 − ΘB12´ wt,

slide-89
SLIDE 89

123 so (1 − (1 + φ) B + φB2 − B12 + (1 + φ) B13 − φB14)· (1 + ψ1B + ψ2B2 + ... + ψkBk + ...) =

³

1 − ΘB12´ . For 1 ≤ k < 12 the coefficient of Bk is = 0 on the RHS; on the LHS it is ψ1 − (1 + φ) (k = 1); ψk − (1 + φ) ψk−1 + φψk−2 (k > 1). Thus ψ0 = 1, ψ1 = 1 + φ, ψk = (1 + φ) ψk−1 − φψk−2, k = 2, ..., 11.

  • See Figures 2.18 - 2.20.
slide-90
SLIDE 90

124 Figure 2.18. Time domain > ARIMA, to fit just one

  • model. (See also Time domain > Autoregression.)

Figure 2.19. Forecasts up to 12 steps ahead.

slide-91
SLIDE 91

125 Figure 2.20. Graphical output, including forecasts.

slide-92
SLIDE 92

126 14. Lecture 14 (Review) Recall SOI and Recruits series. We earlier attempted to predict Recruits from SOI by regressing Recruits on lagged SOI (m = 3, ..., 12) - 11 parameters, including the intercept. See Figure 2.21. Figure 2.21. Recruits - data and predictions from a regression on SOI lagged by m = 3, ..., 12 months. R2 = 67%.

slide-93
SLIDE 93

127 Let’s look for a better, and perhaps more parsimo- nious, model. Put Xt = ∇12SOIt, Yt = ∇12Recruitst (both re-indexed: t = 1, ..., 441) - this is to remove an annual trend, and to make this example more in- teresting. Both look stationary. (What do we look for in order to make this statement?). Consider the CCF, with Yt as input and Xt as output (Figure 2.22). Figure 2.22. COV [∇12SOI, ∇12RECRUITS] is largest at lags m = −8, −9, −10. Largest (absolute) values of COV [Xt+m, Yt] = COV [Xt, Yt−m]

slide-94
SLIDE 94

128 are at m = −8, −9, −10, indicating a linear relation- ship between Xt and Yt+8, Yt+9,Yt+10. Thus we might regress Yt+10

  • n Yt+9, Yt+8 and Xt; equiv-

alently Yt on Yt−1, Yt−2, Xt−10: Yt = β1Yt−1+β2Yt−2+β3Xt−10+Zt, (t = 11, ..., 441) for a series {Zt} following a time series model, and being white noise if we’re lucky. See Figure 2.23. Figure 2.23. Data and regression fit, Yt on Yt−1, Yt−2, Xt−10. R2 = 90%.

slide-95
SLIDE 95

129 Multiple regression on Y(t) AICc = 6.07724 Variance = 158.450 df = 428 R2 = 0.8995 predictor coef

  • st. error

t-ratio p-value beta(1) 1.3086 .0445 29.4389 .000 beta(2)

  • .4199

.0441

  • 9.5206

.000 beta(3)

  • 4.1032

1.7181

  • 2.3883

.017 Figure 2.24. ACF and PACF for residual series {Zt}.

slide-96
SLIDE 96

130 The residual series {Zt} still exhibits some trends

  • see Figure 2.24 - seasonal ARMA(4, 1)12? ARIMA

Search, with AICc, selects ARMA(1, 1)12. With BIC, ARMA(1, 2)12 is selected. Each of these was

  • fitted. Both failed the cumulative spectrum test, and

the Box-Pierce test with M −K = 1, although the p- values at M − K = 20 were non-significant. (What are these, and what do they test for? What does “p-value” mean?) This suggested adding an AR(1) term to the simpler of the two previous models. In terms of the backshift operator (what is it? how are these polynomials manipulated?): (1 − φB)

³

1 − ΦB12´ Zt =

³

1 − ΘB12´ wt.

slide-97
SLIDE 97

131 ARIMA(1,0,0)x(1,0,1)x12 from Z(t) AICc = 5.45580 variance = 85.0835 d.f. = 415 Start values derived from ACF and PACF predictor coef

  • st. error

t-ratio p-value AR(1)

  • .0852

.04917

  • 1.7327

.083 SAR(1)

  • .1283

.05030

  • 2.5516

.011 SMA(1) .8735 .02635 33.1524 .000 (1 +.09B1) (1 +.13B12) Z(t) = (1 -.87B12) w(t) See Figure 2.25. All residuals tests were passed, with the exception of the Normal Scores test (what does this mean? how does this test work?).

slide-98
SLIDE 98

132 Figure 2.25. ACF and PACF of residuals from an ARIMA(1, 0, 0) × (1, 0, 1)12 fit to Zt. Note that:

  • {Zt} seems to be linear, hence stationary. (What

does linearity mean, and how can we make this claim? How can we then represent Zt as a linear series Zt = α(B)wt = P∞

k=0 αkwt−k?)

  • {Zt} seems to be invertible.

(What does invert- ibility mean, and how can we make this claim? Representation?)

slide-99
SLIDE 99

133 The polynomial β(B) = 1 − β1B − β2B2 has zeros B = 1.34, 1.77. This yields the additional represen- tation Yt = β3B10 β(B) Xt + α(B) β(B)wt = β3γ(B)B10Xt + δ(B)wt, with 1/β(B) expanded as a series γ(B) and δ(B) = α(B)/β(B). Assume {wt} independent of {Xt, Yt}.

slide-100
SLIDE 100

134 Predictions: You should be sufficiently familiar with the prediction theory covered in class that you can follow the following development, although I wouldn’t expect you to come up with something like this on your

  • wn. (The point of knowing the theory is so that you

can still do something sensible when the theory you know doesn’t quite apply!) The best forecast of Yt+l is ˆ Y t

t+l = E

h

Yt+l|Y ti (es- timated), in the case of a single series {Yt}. (Why?

  • you should be able to apply the Double Expectation

Theorem, so as to derive this result. And what does “best” mean, in this context?)

slide-101
SLIDE 101

135 If we think of the data now as consisting of the union

  • f the series {Xt, Yt} then it follows that the predic-

tions are ˆ Y t

t+l = E

h

Yt+l|Xt, Y ti = E

h

β1Yt+l−1 + β2Yt+l−2|Xt, Y ti +E

h

β3Xt+l−10|Xt, Y ti + E

h

Zt+l|Xt, Y ti = β1 ˆ Y t

t+l−1 + β2 ˆ

Y t

t+l−2 + β3 ˆ

Xt

t+l−10 +

X

k≥l

αkwt+l−k. Upon estimating parameters these are =

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩

ˆ β1Yt + ˆ β2Yt−1 + ˆ β3Xt−9, l = 1, ˆ β1 ˆ Y t

t+1 + ˆ

β2Yt + ˆ β3Xt−8, l = 2, ˆ β1 ˆ Y t

t+l−1 + ˆ

β2 ˆ Y t

t+l−2 + ˆ

β3Xt+l−10 l = 3, ..., 10. +

X

k≥l

ˆ αk ˆ wt+l−k (= ˆ Zt

t+l, the forecast from the 2nd fit).

For l > 10 we would need to obtain the forecasts ˆ Xt

t+l−10 as well.

slide-102
SLIDE 102

136 For l = 1, ..., 10, from Yt = β3γ(B)B10Xt + δ(B)wt, we obtain Yt+l = β3γ(B)Xt+l−10 + δ(B)wt+l, ˆ Y t

t+l

= β3γ(B)Xt+l−10 +

X

k≥l

δkBkwt+l and hence Yt+l − ˆ Y t

t+l

=

X

k<l

δkBkwt+l =

l−1

X

k=0

δkwt+l−k, with V AR

h

Yt+l − ˆ Y t

t+l

i

= σ2

w

Pl−1

k=0 δ2 k.

This leads to prediction intervals ˆ Y t

t+l ± zα/2ˆ

σw

v u u u t

l−1

X

k=0

ˆ δ2

k.

How now are the ˆ δk computed?

slide-103
SLIDE 103

137 Other theoretical things you should be familiar with:

  • Calculations of ACFs (pretty simple for an MA;

derive and solve the Yule-Walker equations for an AR). What is the important property of the ACF

  • f an MA?
  • Calculation of PACFs - define the PACF, then do

the required minimization. What is the impor- tant property of the PACF of an AR?

  • Maximum likelihood estimation

— The end result of our theoretical discussion was that the MLEs of the AR and MA para- meters φ and θ are the minimizers of the sum

  • f squares of the residuals:

S (φ, θ) =

X

w2

t (φ, θ) .

For a pure AR this is accomplished by a lin- ear regression of the series on its own past.

slide-104
SLIDE 104

138 If there are MA terms then an iterative pro- cedure (Gauss-Newton) is necessary. As an example, you should be able to show that, un- der the usual assumption (what is it?), for an MA(1) model you get wt (θ) =

t−1

X

s=0

θsXt−s, so that S (θ) is a polynomial in θ, of degree 2 (t − 1), to be minimized. What would you use as the starting estimate, in the iterative procedure? How would the iterations then be carried out? Your calcula- tions should yield the iterative procedure θk+1 = θk −

P wt (θk) ˙

wt (θk)

P ˙

w2

t (θk)

.

slide-105
SLIDE 105

139 — The usual estimate of the noise variance, ob- tain by modifying the MLE, is ˆ σ2

w

= S

³ˆ

φ, ˆ θ

´

# of residuals − # of parameters =

P (residuals)2

# of residuals − # of parameters. — What is the “Information Matrix”, and what role does it play in making inferences about the parameters in an ARMA model?