Analysis of Big Dependent Data in Economics and Finance Ruey S. - - PowerPoint PPT Presentation

analysis of big dependent data in economics and finance
SMART_READER_LITE
LIVE PREVIEW

Analysis of Big Dependent Data in Economics and Finance Ruey S. - - PowerPoint PPT Presentation

Analysis of Big Dependent Data in Economics and Finance Ruey S. Tsay Booth Shool of Business, University of Chicago September 2016 Ruey S. Tsay Big Dependent Data 1 / 72 Outline Big data? Machine learning? Data science? What is in for 1


slide-1
SLIDE 1

Analysis of Big Dependent Data in Economics and Finance

Ruey S. Tsay Booth Shool of Business, University of Chicago September 2016

Ruey S. Tsay Big Dependent Data 1 / 72

slide-2
SLIDE 2

Outline

1

Big data? Machine learning? Data science? What is in for economics and finance?

2

Real-world data are often dynamically dependent

3

A simple example: Methods for independent data may fail

4

Trade-off between simplicity and reality

5

Some methods useful for analyzing big dependent data in economics and finance

6

Examples

7

Concluding remarks

Ruey S. Tsay Big Dependent Data 2 / 72

slide-3
SLIDE 3

Big dependent data

1

Accurate information is the key to success in the competitive global economy. Information age.

2

What is big data? High dimension (many variables)? Large sample size? Both?

3

Not all big data sets are useful. Confounding & Noises

4

Need to develop methods to efficiently extract useful information from big data

5

Know the limitations of big data

6

Issues emerged from big data: privacy? ethical issues?

7

Focus on methods for analyzing big dependent data in economics and finance

Ruey S. Tsay Big Dependent Data 3 / 72

slide-4
SLIDE 4

What are available?

Statistical methods:

1

Focus on sparsity (Simplicity)

2

Various penalized regressions, e.g. Lasso and its extensions

3

Various dimension reduction methods and models

4

Common framework used: Independent observations, with limited extensions to stationary data Real data are often dynamically dependent! Some useful concepts in analyzing big data:

1

Parsimony vs sparsity: Parsimony ⇒ Sparsity

2

Simplicity vs reality: trade-off btw feasibility & sophistication

Ruey S. Tsay Big Dependent Data 4 / 72

slide-5
SLIDE 5

Parsimonious, not sparse

A simple example yt = c +

k

  • i=1

βxit + ǫt = c + β

k

  • i=1

xit + ǫt, where k is large, xit are not perfectly correlated, and ǫt are iid N(0, σ2). The model has three parameters so it is parsimonious, but not sparse because y depends on all explanatory variables. In some applications, k

i=1 xit is a close approximation to the

first principal component. For example, the level of interest rates is important to an economy. Fused-Lasso can solve this difficulty in some situations.

Ruey S. Tsay Big Dependent Data 5 / 72

slide-6
SLIDE 6

What is LASSO regression?

Model: (assume mean-adjusted) yi =

p

  • j=1

βjXj,i + ǫi. Matrix form: X is the design matrix Y = Xβ + ǫ. Objective function: In particular, if p > T

  • β(λ) = arg min

β (Y − Xβ2

2/T + λβ1),

where λ ≥ 0 is a penalty parameter, β1 = p

j=1 |βj|,

Y − Xβ2

2 = T i=1(yi − X ′ iβ)2

Ruey S. Tsay Big Dependent Data 6 / 72

slide-7
SLIDE 7

What is the big deal?

Sparsity Using convexity, LASSO is equivalent to

  • βopt(R) = arg

min β;β1≤R Y − Xβ2

2/T.

Old friend: Ridge regression

  • βRidge(λ) = arg min

β (Y − Xβ2

2/T + λβ2 2), or

  • β(R) = arg

min β;β2

2≤R

Y − Xβ2

2/T.

Special case: p = 2. Y − Xβ2

2/T is quadratic. β1 is a

region of diamond shape, yet β2

2 is a circle. Thus, LASSO

leads to sparsity.

Ruey S. Tsay Big Dependent Data 7 / 72

slide-8
SLIDE 8

Computation and extensions

1

Optimization: Least angle regression (lars) by Efron et al. (2004) makes the computation very efficient.

2

Extensions:

Group lasso: Yuan and Lin (2006). Subsets of X have specific meaning, e.g. treatment Elastic net: Zou and Hastie (2005). Using a combination of L1 and L2 penalties SCAD: Fan and Li (2001). Nonconcave penalized

  • likelihood. [Smoothly clipped absolute deviation (SCAD).]

Various Bayesian methods: penalty function is the prior.

3

Packages available in R: lars, glmnet, gamlr, gbm and many others.

Ruey S. Tsay Big Dependent Data 8 / 72

slide-9
SLIDE 9

A simulated example

p = 300, T = 150, X iid N(0, 1), ǫi iid N(0, 0.25). yi = x3i+2(x4i+x5i+x7i)−2(x11,i+x12,i+x13,i+x21,i+x22,i+x30,i)+ǫi

1

How? R demonstration

2

Selection of λ? Cross-validation (10-fold), measurement of prediction accuracy

3

The commands lars and cv.lars of the package lars

4

The commands glmnet and cv.glmnet of the package glmnet

5

Relationship between the two packages (alpha = 0)

Ruey S. Tsay Big Dependent Data 9 / 72

slide-10
SLIDE 10

Lasso may fail for dependent data

1

Data generating model: scalar Gaussian autoregressive, AR(3), model xt = 1.9xt−1 − 0.8xt−2 − 0.1xt−3 + at, at ∼ N(0, 1). Generate 2000 observations. See Figure 1.

2

Big data setup

Dependent xt: t = 11, . . . , 2000 Regressors: Xt = [xt−1, xt−2, . . . , xt−10, z1t, . . . , z10,t], where zit are iid N(0, 1). Dimension = 20, sample size 1990.

3

Run the Lasso regression via the lars package of R. See Figure 2 for results. Lag 3, xt−3 was not selected. Lasso fails in this case.

Ruey S. Tsay Big Dependent Data 10 / 72

slide-11
SLIDE 11

Time xt 500 1000 1500 2000 −40000 −35000 −30000 −25000

Figure: Time plot of simulated AR(3) time series with 2000

  • bservations

Ruey S. Tsay Big Dependent Data 11 / 72

slide-12
SLIDE 12

* * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * 0.0 0.2 0.4 0.6 0.8 1.0 −2e+05 0e+00 2e+05 4e+05 |beta|/max|beta| Standardized Coefficients * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** * * * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * * * * * * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * * * ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * * * * * * ** * ** * * * * * ** ** * * * * * ** * * * * * * ** * * * ** * * * * * * * * *

LASSO

2 6 5 1 1 9 23 28 35 39 40 43 48 50

Figure: Results of Lasso regression for the AR(3) series

Ruey S. Tsay Big Dependent Data 12 / 72

slide-13
SLIDE 13

OLS works if we entertain AR models

Run the linear regression using the first three variables of Xt. Fitted model xt = 1.902xt−1 − 0.807xt−2 − 0.095xt−3 + ǫt, σǫ = 1.01. All estimates are statistically significant with p-value less than 2.22 × 10−5. The residuals are well behaved, e.g. Q(10) = 12.23 with p-value 0.20 (after adjusting the df). Simple time series method works for dependent data.

Ruey S. Tsay Big Dependent Data 13 / 72

slide-14
SLIDE 14

Why does lasso fail?

Two possibilities:

1

Scaling effect: Lasso standardizes each variable in Xt. For unit-root non-stationary time series, standardization might wash out the dependence in the stationary part

2

Multicollinearity: Unit-root time series have strong serial

  • correlations. [ACF approach 1 for all lags.]

This artificial example highlights the difference between independent and dependent data. Need to develop methods for big dependent data!

Ruey S. Tsay Big Dependent Data 14 / 72

slide-15
SLIDE 15

Possible solutions

1

Re-parameterization using time series properties

2

Use different penalties for different parameters The first approach is easier. For the particular time series, we can define ∆xt = (1 − B)xt and ∆2xt = (1 − B)2xt. Then, xt = 1.9xt−1 − 0.8xt−2 − 0.1xt−3 + at = xt−1 + ∆xt−1 − 0.1∆2xt−1 + at = double + single + stationary + at. The coefficients of xt−1, ∆xt−1, ∆2xt−1 are 1, 1, an −0.1, respectively.

Ruey S. Tsay Big Dependent Data 15 / 72

slide-16
SLIDE 16

Different frameworks for LASSO

The X-matrix of conventional LASSO consists of (xt−1, xt−2, . . . , xt−10, z1t, . . . , z10,t), where zit are iid N(0, 1). Under the re-parameterization, the X-matrix becomes (xt−1, ∆xt−1, ∆2xt−1, . . . , ∆2xt−8, z1t, . . . , z10,t). These two X-matrices provide theoretically the same

  • information. However, the first one has high multicollinearity,

but the 2nd one does not, especially after standardization.

Ruey S. Tsay Big Dependent Data 16 / 72

slide-17
SLIDE 17

5 10 15 20 0.0 0.4 0.8 1:20 β2 5 10 15 20 0.0 0.4 0.8 1:20 β3 5 10 15 20 −0.2 0.2 0.6 1.0 1:20 β4 5 10 15 20 0.0 0.4 0.8 1:22 β2 5 10 15 20 0.0 0.4 0.8 1:22 β3 5 10 15 20 0.0 0.4 0.8 1:22 β4

Figure: Comparison of β-estimates of lars results

Ruey S. Tsay Big Dependent Data 17 / 72

slide-18
SLIDE 18

Theoretical justification

Focus on the particular series xt used. Some properties of the series are

1

T −4 T

t=1 x2 t ⇒

1

0 ¯

W 2, where ¯ W = 1

0 W(s)ds with W(s)

the standard Brownian motion.

2

T −5/2 T

t=1 xt ⇒

1

0 ¯

W

3

T −3 T

t=1 xt∆xt ⇒

1

0 ¯

WW

4

T −2 T

t=1(∆xt)2 ⇒

1

0 W 2

Standardization may wash out the ∆xt−1 and ∆2xt−1 parts.

Ruey S. Tsay Big Dependent Data 18 / 72

slide-19
SLIDE 19

Examples of big dependent data

1

Daily returns of U.S. stocks

2

Demand of electricity every 30-m intervals

3

Daily spreads of CDS (credit default swaps) of selected companies

4

Monthly unemployment rates of the 50 states of U.S.

5

Interest rates of an economy

6

Air pollution measurements of multiple locations and health

  • risk. Complex spatio-temporal data in general.

Ruey S. Tsay Big Dependent Data 19 / 72

slide-20
SLIDE 20

2012−2013

days N(stocks) 100 200 300 400 500 6600 6700 6600 6700 size

Figure: Sample sizes of U.S. daily stock returns in 2012 and 2013: mean 6681, range = (6593,6774)

Ruey S. Tsay Big Dependent Data 20 / 72

slide-21
SLIDE 21

Time series plot

−0.10 0.00 0.10

Densities of 2012

lnreturn density −0.10 0.00 0.10

Densities of 2013

lnreturn density

Figure: Densities of daily log returns of U.S. stocks in 2012 and 2013.

Ruey S. Tsay Big Dependent Data 21 / 72

slide-22
SLIDE 22

1000 2000 0.00 0.01 0.02 0.03 0.04 0.05 demand

Monday

1000 2000 demand

Tuesday

1000 2000 demand

Wednesday

1000 2000 demand

Thursday

1000 2000 demand

Friday

1000 2000 demand

Saturday

1000 2000 demand

Sunday

Figure: Empirical densities of electricity demand, 30 minute intervals, from July 6, 1997 to March 31, 2007. Adelaide, Australia

Ruey S. Tsay Big Dependent Data 22 / 72

slide-23
SLIDE 23

1980 1990 2000 2010 5 10 15 year urate

State UNRATE: 1976.1 to 2015.9

Figure: Time plots of monthly state unemployment rates of the U.S. from 1976.1 to 2015.9

Ruey S. Tsay Big Dependent Data 23 / 72

slide-24
SLIDE 24

Some statistical methods

Goal: Extract useful information, including pooling.

1

Classification and cluster analysis

K means Tree-based classification Model-based classification

2

Factor models & Extensions

Orthogonal factor model Approximate factor model Dynamic factor model Constrained factor models (column, row constraints) X t = Rf tC + et

3

Generalizations of Lasso methods to dependent data, e.g. LASSO for nowcasting vs MIDAS

Ruey S. Tsay Big Dependent Data 24 / 72

slide-25
SLIDE 25

Constrained factor models

Column (variable) constraint only: Tsai & Tsay (2010) Let zt be a k-dimensional time series zt = Hωf t + ǫt, t = 1, . . . , T where H is a k × r known matrix, f t is m-dimensional common factor, ω is r × m unknown loading parameters. For observed data in matrix form Z = Fω′H′ + ǫ

Ruey S. Tsay Big Dependent Data 25 / 72

slide-26
SLIDE 26

A simple illustration

Monthly log returns of 10 stocks from 2001 to 2011

1

Semi-conductor: TXN, MU, INTC, TSM

2

Pharmaceutical: PFE, MRK, LLY

3

Investment bank: JPM, MS, GS The constraints H = [h1, h2, h3], where h1 = (1, 1, 1, 1, 0, 0, 0, 0, 0, 0)′ h2 = (0, 0, 0, 0, 1, 1, 1, 0, 0, 0)′ h3 = (0, 0, 0, 0, 0, 0, 0, 1, 1, 1)′

Ruey S. Tsay Big Dependent Data 26 / 72

slide-27
SLIDE 27

Table: Estimation Results of Constrained and Orthogonal Factor Models Stock Constrained Model: L = H ω Orthogonal Model: PCA Tick L1 L2 L3 Σǫ,i L1 L2 L3 Σǫ,i TXN 0.76 0.26 0.27 0.28 0.79 0.20 0.32 0.24 MU 0.76 0.26 0.27 0.28 0.67 0.36 0.29 0.34 INTC 0.76 0.26 0.27 0.28 0.79 0.18 0.33 0.23 TSM 0.76 0.26 0.27 0.28 0.80 0.27 0.16 0.26 PFE 0.44

  • 0.68

0.10 0.34 0.49

  • 0.64
  • 0.03

0.35 MRK 0.44

  • 0.68

0.10 0.34 0.40

  • 0.69

0.23 0.31 LLY 0.44

  • 0.68

0.10 0.34 0.45

  • 0.70

0.06 0.31 JPM 0.74 0.06

  • 0.43

0.27 0.72 0.02

  • 0.35

0.36 MS 0.74 0.06

  • 0.43

0.27 0.76 0.05

  • 0.43

0.25 GS 0.74 0.06

  • 0.43

0.27 0.75 0.12

  • 0.50

0.18 e.v. 4.58 1.65 0.88 4.63 1.68 0.93 Variability explained: 70.6% Variability explained: 72.4%

Ruey S. Tsay Big Dependent Data 27 / 72

slide-28
SLIDE 28

Both row and column constraints

: Tsai, et al (2016) T observations and k variables. Data matrix form Z = F 1ω′

1H′ + GF 2ω′ 2 + GF 3ω′ 3H′ + E,

where G denotes a known T × m row constraint matrix.

Ruey S. Tsay Big Dependent Data 28 / 72

slide-29
SLIDE 29

Ruey S. Tsay Big Dependent Data 29 / 72

slide-30
SLIDE 30

1998 2002 2006 7.8 8.4

New England

1998 2002 2006 8.6 9.0 9.4

Middle Atlantic

1998 2002 2006 9.2 9.6

East North Central

year 1998 2002 2006 8.2 8.8 9.4

West Noth Central

1998 2002 2006 10.2 10.8

South Atlantic

1998 2002 2006 8.6 9.0 9.4

East South Central

year 1998 2002 2006 9.2 9.6 10.2

West South Central

1998 2002 2006 9.4 9.8

moutain

1998 2002 2006 9.4 9.8

Pacific

year

Figure: Time plots of monthly housing starts (in logarithms) of 9 U.S. divisions: 1997-2006.

Ruey S. Tsay Big Dependent Data 30 / 72

slide-31
SLIDE 31

F1[,1]

Time 20 40 60 80 100 120 −1.0 −0.5 0.0 0.5 1.0

F1[,2]

Time 20 40 60 80 100 120 −1.5 −0.5 0.5 1.0 1.5

F2[,1]

Time 2 4 6 8 10 12 −0.6 −0.2 0.0 0.2 0.4

F2[,2]

Time 2 4 6 8 10 12 −0.2 0.0 0.2 0.4

F3[,1]

2 4 6 8 10 12 −0.03 −0.01 0.01 0.03

F3[,2]

2 4 6 8 10 12 −0.8 −0.4 0.0 0.4

Figure: Time series plots of common factors for a DCF model of order (r,p,q) = (2,2,2) via maximum likelihood estimation.

Ruey S. Tsay Big Dependent Data 31 / 72

slide-32
SLIDE 32

−0.10 −0.05 0.00 0.05

New England

−0.10 −0.05 0.00 0.05

Middle Atlantic

−0.2 −0.1 0.0 0.1

East North Central

−0.20 −0.10 0.00 0.10

West North Central

−0.08 −0.04 0.00 0.04

South Atlantic

−0.20 −0.10 0.00 0.10

East South Central

−0.15 −0.05 0.05

West South Central

−0.10 −0.05 0.00 0.05

Mountain

−0.06 −0.02 0.02 20 40 60 80 100 120

Pacific Index

ts(Gterm)

Figure: Time series plots for GF 2ω′

2 of a fitted DCF model of order

(2,2,2). Maximum likelihood estimation is used.

Ruey S. Tsay Big Dependent Data 32 / 72

slide-33
SLIDE 33

−0.4 −0.2 0.0 0.2

New England

−0.4 −0.2 0.0 0.2

Middle Atlantic

−0.4 −0.2 0.0 0.2

East North Central

−0.4 −0.2 0.0 0.2

West North Central

−0.4 −0.2 0.0 0.2 0.4

South Atlantic

−0.4 −0.2 0.0 0.2 0.4

East South Central

−0.4 −0.2 0.0 0.2 0.4

West South Central

−0.4 −0.2 0.0 0.2 0.4

Mountain

−0.4 −0.2 0.0 0.2 0.4 20 40 60 80 100 120

Pacific Index

ts(Hterm)

Figure: Time series plots for F 1ω′

1H′ of a fitted DCF model of order

(2,2,2). Maximum likelihood estimation is used.

Ruey S. Tsay Big Dependent Data 33 / 72

slide-34
SLIDE 34

−0.10 −0.05 0.00 0.05

New England

−0.10 −0.05 0.00 0.05

Middle Atlantic

−0.10 −0.05 0.00 0.05

East North Central

−0.10 −0.05 0.00 0.05

West North Central

−0.3 −0.2 −0.1 0.0 0.1

South Atlantic

−0.3 −0.2 −0.1 0.0 0.1

East South Central

−0.3 −0.2 −0.1 0.0 0.1

West South Central

−0.10 −0.05 0.00 0.05

Mountain

−0.10 −0.05 0.00 0.05 20 40 60 80 100 120

Pacific Index

ts(GHterm)

Figure: Time series plots for GF 3ω′

3H′ of a fitted DCF model of order

(2,2,2). Maximum likelihood estimation is used.

Ruey S. Tsay Big Dependent Data 34 / 72

slide-35
SLIDE 35

Matrix-valued variables

Consider simultaneously n macroeconomic variables in k countries U.S. Italy Spain · · · Canada GDP X11,t X12,t X13,t · · · X1k,t Unem X21,t X22,t X23,t X2k,t CPI X31,t X32,t X33,t X3k,t . . . . . . . . . M1 Xn1,t Xn2,t Xn3,t · · · Xnk,t On-going: only preliminary results are available. See Chen et al (2016)

Ruey S. Tsay Big Dependent Data 35 / 72

slide-36
SLIDE 36

Classification

A possible approach: Use a two-step procedure

1

Transform dependent big data into functions, e.g. probability densities

2

Apply classification methods to functional data The density functions of daily log returns of U.S. stocks serve as an example. We can then classify the density functions to make statistical inference

Ruey S. Tsay Big Dependent Data 36 / 72

slide-37
SLIDE 37

Illustration of classification

Cluster Analysis of density functions Consider the time series of density functions {ft(x)}. For simplicity, assume the densities are evaluated at equally-spaced grid point {x1 < x2 < . . . < xN} ∈ D with increment ∆x. The data we have become {ft(xi)|t = 1, . . . , T; i = 1, . . . , N}. Using Hellinger distance (HD), we consider two methods: K means Tree-based classification

Ruey S. Tsay Big Dependent Data 37 / 72

slide-38
SLIDE 38

Hellinger distance of two density functions

Let f(x) and g(x) be two density functions on the common domain D ⊂ R. Assume both density functions are absolutely continuous w.r.t. the Lebesgue measure. The Hellinger distance (HD) between f(x) and g(x) is defined as H(f, g)2 = 1 2

  • D
  • f(x) −
  • g(x)

2 dx = 1 −

  • D
  • f(x)g(x)dx

Basic properties:

1

H(f, g) ≥ 0

2

H(f, g) = 0 if and only if f(x) = g(x) almost surely.

Ruey S. Tsay Big Dependent Data 38 / 72

slide-39
SLIDE 39

K-means method

For a given K, the K-means method seeks partitions of the densities, say, C1, . . . , CK, such that

1

K

k=1 Ck = {ft(x)}

2

Ci Cj = ∅ for i = j

3

Sum of within-cluster variation V = K

k=1 V(Ck) is

minimized, where the within-cluster variation is V(Ck) =

  • t1,t2∈Ck

H(ft1, ft2)2 It turns out this can easily be done by applying the K-means method with squared Euclidean distance to the squared-root densities {

  • ft(x)}.

Ruey S. Tsay Big Dependent Data 39 / 72

slide-40
SLIDE 40

Example of K-means

Consider the 48 density functions of half-hour demand of electricity on Monday in Adelaide, Australia. With K = 4 clusters, we have k Elements (time index) Calendar Hours 1 17 to 44 8:00 AM to 10:00 PM 2 15, 16, 45 to 48, 1, 2, 3 7:00 − 8:00 AM; 10:00 PM − 1:30 AM 3 4, 5, 13, 14 1:30 − 2:30 AM; 6:00 − 7:00 AM 4 6 to 12 2:30 − 6:00 AM Result: capture daily activities, namely, (1) active period, (2) transition period, (3) light sleeping period, and (4) sound sleeping period.

Ruey S. Tsay Big Dependent Data 40 / 72

slide-41
SLIDE 41

1000 1500 2000 2500 3000 0.000 0.001 0.002 0.003 0.004 megawatts density

Mondaydemand

Figure: Density functions of half-hour electricity demand on Monday at Adelaide, Australia. The sample period is from July 6, 1997 to March 31, 2007.

Ruey S. Tsay Big Dependent Data 41 / 72

slide-42
SLIDE 42

1000 1500 2000 2500 3000 0.000 0.001 0.002 0.003 0.004 Megawatts density

Figure: Results of K-means Cluster Analysis Based on Squared Hellinger Distance for Electricity Demands on Monday. Different colors denote different clusters.

Ruey S. Tsay Big Dependent Data 42 / 72

slide-43
SLIDE 43

Tree-based classification

Let Z t = (z1t, . . . , zpt)′ denote p covariates. We use an iterative procedure to build a binary tree, starting with the root C0 = {ft(x)}.

1

For each covariate zit, let zi(j) be the jth order statistic

1

Divide C0 into two sub-clusters Ci,j,1 = {ft(x)|zit ≤ zi(j)}; Ci,j,2 = {ft(x)|zit > zi(j)}

2

Compute the sum of within-cluster variations H(i, j) = V(Ci,j,1) + V(Ci,j,2)

3

Find the smallest j, say vi, such that H(i, vi) = minj{H(i, j)}.

2

Select i ∈ {1, . . . , p}, say I, such that H(I, vI) = mini{H(i, vi)}.

3

Use covariate zIt with threshold vI to grow two new leaves, i.e. C1,1 = CI,vI,1, C1,2 = CI,vI,2

Ruey S. Tsay Big Dependent Data 43 / 72

slide-44
SLIDE 44

Tree-based procedure continued

Next, consider C1,1 and C1,2 as the root of a branch and apply the same procedure with their associated covariates to find candidate for growth. The only modification is as follows: When considering C1,1, we treat C1,2 as a leaf in computing the sum of within-cluster

  • variations. Similarly, when considering C1,2 for further division,

we treat C1,1 as a leaf in computing the sum of within-cluster variations. This growth-procedure is iterated until the number of clusters K is reached.

Ruey S. Tsay Big Dependent Data 44 / 72

slide-45
SLIDE 45

Example of tree-based classification

Consider the density functions of U.S. daily log stock returns in 2012 and 2013. Using the first-differenced VIX index as the explanatory variable and K = 4, we obtain 4 clusters as follows: (−∞, −0.73], (−0.73, 0.39], (0.39, 1, 19], (1.19, ∞). The cluster sizes are 104, 259, 86, and 53, respectively. Note that positive zt signifies an increase in market volatility (uncertainty).

Ruey S. Tsay Big Dependent Data 45 / 72

slide-46
SLIDE 46

What drove the U.S. financial market?

The Fear Factor

days VIX 100 200 300 400 500 15 25

Change series of VIX

days diff(VIX) 100 200 300 400 500 −4 4

Figure: Time plots of the market fear factor (VIX index) and its change series: 2012-2013

Ruey S. Tsay Big Dependent Data 46 / 72

slide-47
SLIDE 47

−0.2 −0.1 0.0 0.1 0.2 20 40 60 log−rtn density

dvix > 1.19

−0.2 −0.1 0.0 0.1 0.2 20 40 60 log−rtn density

1.19 >= dvix > 0.39

−0.2 −0.1 0.0 0.1 0.2 20 40 60 log−rtn density

0.39 >= dvix > −0.73

−0.2 −0.1 0.0 0.1 0.2 20 40 60 log−rtn density

dvix <= −0.73

Figure: Results of Tree-based Cluster Analysis for the Daily Densities

  • f Log Returns of the U.S. Stocks in 2012 and 2013. The

first-differenced series of the VIX index is used as the explanatory

  • variable. The numbers of element for the clusters are 53, 86, 259,

and 104, respectively. The cluster classification is given in the heading of each plot.

Ruey S. Tsay Big Dependent Data 47 / 72

slide-48
SLIDE 48

Model-based classification

Work directly on observed multiple time series

1

Postulate a general univariate model for all time series, e.g. an AR(p) model

2

Time series in a cluster follow the same model: Pooling data to estimate common parameters

3

Time series in different clusters follow different models

4

May be estimated by Markov chain Monte Carlo methods

5

May employ scaled-mixture of normal innovations to handle outliers Have been widely studied, e.g. Wang et al (2013) and Fruehwirth-Schnatter (2011), among others.

Ruey S. Tsay Big Dependent Data 48 / 72

slide-49
SLIDE 49

Application

1

Apply to monthly unemployment rates of 50 states of the U.S.

2

Use out-of-sample predictions to compare with other methods, including lasso.

3

For 1-step to 5-step ahead predictions, the model-based method works well in comparison. Wang et al (2013, JoF).

Ruey S. Tsay Big Dependent Data 49 / 72

slide-50
SLIDE 50

RMSE×104 MAE×104 Method m = 1 m = 2 m = 3 m = 4 m = 1 m = 2 m = 3 m UAR 1616 1492 1791 2073 879 994 1268 1386 VAR 2676 2095 2129 2759 1349 1353 1506 1624 Lasso25 1798 1833 2063 2504 1245 1250 1332 1401 Lasso15 1714 1798 1855 2028 1186 1228 1296 1399 G-Lasso 1877 1865 1882 1905 1291 1290 1306 1327 LVAR 1550 1716 1806 1904 1065 1298 1210 1355 Pls10 1239 1531 1679 1873 909 1028 1263 1226 Pls30 1395 1651 1835 1890 933 1092 1281 1320 Pls50 1685 1871 2006 1967 940 1158 1304 1377 Pls70 1914 2040 2182 1953 996 1222 1362 1432 Pls100 2187 2279 2313 2123 1099 1342 1480 1552 Pcr10 1276 1829 2077 2108 890 1073 1247 1415 Pcr30 1577 1837 2049 1769 888 1093 1261 1321 Pcr50 1546 1805 2017 1759 880 1035 1209 1260 Pcr70 1594 1837 2049 1769 886 1042 1221 1283 Pcr100 1649 2117 2202 2163 1068 1243 1324 1421 MBC 1607 1703 1809 1961 885 1035 1225 1361 rMBC 1225 1481 1691 1839 873 1027 1193 1295 Table: Root mean squared errors (RMSE) and mean absolute error

Ruey S. Tsay Big Dependent Data 50 / 72

slide-51
SLIDE 51

Functional PCA: Singular value decomposition

1

A tool to study the time evolution of the return distributions

2

Data set: In this particular instance, each density function is evaluated at 512 points and we have Y = [Yit = ft(xi)|i = 1, . . . , N; t = 1, . . . , T]512×502

3

Perform singular value decomposition ˜ Y = (N − 1)UDV ′ where ˜ Y denotes column-mean adjusted data matrix, U is an N × N unitary matrix, D is an N × T rectangular diagonal matrix, and V is a T × T unitary matrix.

4

This is a simple form of functional PCA. [Large samples, smoothing of PC is not needed.]

Ruey S. Tsay Big Dependent Data 51 / 72

slide-52
SLIDE 52

Scree plot

Comp.1 Comp.3 Comp.5 Comp.7 Comp.9

Screeplot

Variances 5000 10000 15000 20000

Figure: Scree plot of PCA for daily return densities in 2012 and 2013.

Ruey S. Tsay Big Dependent Data 52 / 72

slide-53
SLIDE 53

The first 6 PC functions

−0.2 −0.1 0.0 0.1 0.2 400 800 lnreturn pc1 −0.2 −0.1 0.0 0.1 0.2 −150 150 lnreturn pc2 −0.2 −0.1 0.0 0.1 0.2 −100 lnreturn pc3 −0.2 −0.1 0.0 0.1 0.2 −60 40 lnreturn pc4 −0.2 −0.1 0.0 0.1 0.2 −30 0 20 lnreturn pc5 −0.2 −0.1 0.0 0.1 0.2 −20 20 lnreturn pc6

Figure: The first 6 PC functions for daily log return densities in 2012 and 2013.

Ruey S. Tsay Big Dependent Data 53 / 72

slide-54
SLIDE 54

The next 6 PC functions

−0.2 −0.1 0.0 0.1 0.2 −15 0 10 lnreturn pc7 −0.2 −0.1 0.0 0.1 0.2 −10 10 lnreturn pc8 −0.2 −0.1 0.0 0.1 0.2 −10 0 5 lnreturn pc9 −0.2 −0.1 0.0 0.1 0.2 −5 5 lnreturn pc10 −0.2 −0.1 0.0 0.1 0.2 −6 4 lnreturn pc11 −0.2 −0.1 0.0 0.1 0.2 −4 2 lnreturn pc12

Figure: The 7th-12th PC functions for daily log return densities in 2012 and 2013.

Ruey S. Tsay Big Dependent Data 54 / 72

slide-55
SLIDE 55

Meaning of PC functions? 1st

−0.2 −0.1 0.0 0.1 0.2 10 20 30 40 lnreturn pc1

Mean density pm first PC

Figure: Mean density ± 1st PC: Peak and tails: mean+ standardized 1st PC (red).

Ruey S. Tsay Big Dependent Data 55 / 72

slide-56
SLIDE 56

Meaning of PC functions? 2nd

−0.2 −0.1 0.0 0.1 0.2 10 20 30 40 lnrturn pc2

Mean density pm 2nd PC

Figure: Mean density ± 2nd PC: Midrange returns

Ruey S. Tsay Big Dependent Data 56 / 72

slide-57
SLIDE 57

Meaning of PC functions? 3rd

−0.2 −0.1 0.0 0.1 0.2 10 20 30 40 lnreturn pc3

Mean density pm 3rd PC

Figure: Mean density ± 3rd PC: Curvature

Ruey S. Tsay Big Dependent Data 57 / 72

slide-58
SLIDE 58

Approximate factor models

ft(x) =

p

  • i=1

λt,igi(x) + ǫt(x), where gi(x) denotes the ith common factor and ǫt(x) is the noise function.

1

A generalization of the orthogonal factor model, but allows the error functions to be correlated.

2

Only asymptotically identified under some regularity conditions.

3

FPCA provides a way to estimate approximate factor models.

Ruey S. Tsay Big Dependent Data 58 / 72

slide-59
SLIDE 59

Loadings of the first PC function

−4 −2 2 4 −0.05 −0.04 −0.03 −0.02 dvix Loadings

Figure: Scatter plot of loadings vs changes in VIX index. Red line denotes lowess fit

Ruey S. Tsay Big Dependent Data 59 / 72

slide-60
SLIDE 60

Functional PC via Thresholding

1

Zero appears to be a reasonable and natural threshold

2

Regime 1: dvix ≥ 0 with 244 days. [Volatile (bad) state]

3

Regime 2: dvix < 0 with 258 days. [Calm (good) state]

4

Perform PCA of density functions for each regime.

5

The differences are clearly seen.

6

Leads to different approximate factor models for the density functions

Ruey S. Tsay Big Dependent Data 60 / 72

slide-61
SLIDE 61

Scree plots

Comp.1 Comp.3 Comp.5 Comp.7 Comp.9

dvix >= 0

Variances 0 4000 10000 Comp.1 Comp.3 Comp.5 Comp.7 Comp.9

dvix < 0

Variances 6000

Figure: Scree plots of PCA for each regime

Ruey S. Tsay Big Dependent Data 61 / 72

slide-62
SLIDE 62

The first 6 PC functions

−0.2 −0.1 0.0 0.1 0.2 300 600 lnreturn pc1 −0.2 −0.1 0.0 0.1 0.2 −100 0 100 lnreturn pc2 −0.2 −0.1 0.0 0.1 0.2 −60 40 lnreturn pc3 −0.2 −0.1 0.0 0.1 0.2 −20 20 lnreturn pc4 −0.2 −0.1 0.0 0.1 0.2 −20 20 lnreturn pc5 −0.2 −0.1 0.0 0.1 0.2 −15 10 lnreturn pc6

Figure: The first 6 PC functions for daily log return densities for each regime: red line is for the Calm state, Regime 2

Ruey S. Tsay Big Dependent Data 62 / 72

slide-63
SLIDE 63

Approximate factor models

1

Use approximate factor models with the first 12 principal component functions

2

Compare overall fits with/without thresholding

3

For Regime 1 (positive dvix): randomly select day 17

4

For Regime 2 (negative dvix): randomly select day 420.

5

Check: (a) observed vs fits and (b) residuals of with/without thresholding

6

With 12 components, both approaches fair well, but thresholding provides improvements.

Ruey S. Tsay Big Dependent Data 63 / 72

slide-64
SLIDE 64

Comparison: day 17 (in Regime 1)

−0.2 −0.1 0.0 0.1 0.2 15 30 lnreturn density

density and its fits: day 17

−0.2 −0.1 0.0 0.1 0.2 −0.4 0.2 0.6 lnreturn difference

Error in approximation: red (Thr)

Figure: Top plot: observed (black), all (red), Thr (blue). Bottom plot: all (black), Thr (red)

Ruey S. Tsay Big Dependent Data 64 / 72

slide-65
SLIDE 65

Comparison: day 420 (in Regime 2)

−0.2 −0.1 0.0 0.1 0.2 10 20 30 lnreturn density

density and its fits: day 420

−0.2 −0.1 0.0 0.1 0.2 −0.6 0.0 0.4 lnreturn errors

Errors of approximation: day 420, red(Thr)

Figure: Top plot: observed (black), all (red), Thr (blue). Bottom plot: all (black), Thr (red)

Ruey S. Tsay Big Dependent Data 65 / 72

slide-66
SLIDE 66

Lasso and beyond

1

Need to exploit parsimony, beyond sparsity

2

Need to take into account prior knowledge. We have accumulated lot of knowledge in diverse scientific areas. How to take advantages of this knowledge?

3

Variable selection is not sufficient. More importantly, what are the proper measurements to take? What questions can a given big data answer?

Ruey S. Tsay Big Dependent Data 66 / 72

slide-67
SLIDE 67

An illustration

Every country has many interest series

1

have different maturities

2

serve different financial purposes

3

What is the information embedded in those interest rate series? Consider U.S. weekly constant maturity interest rates

1

From January 8, 1982 to October 30, 2015

2

Maturities: 3m, 6m, 1y, 2y, 3y, 5y, 7y, 10y, and 30y∗

Ruey S. Tsay Big Dependent Data 67 / 72

slide-68
SLIDE 68

1985 1990 1995 2000 2005 2010 2015 5 10 15

Figure: Time plots of U.S. weekly interest rates with different maturities: 1/8/1982 to 10/30/2015.

Ruey S. Tsay Big Dependent Data 68 / 72

slide-69
SLIDE 69

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9

p1

20 40 60 80

Figure: Screeplot of U.S. weekly interest rates.

Ruey S. Tsay Big Dependent Data 69 / 72

slide-70
SLIDE 70

1985 1990 1995 2000 2005 2010 2015 −10 10 20 30

Figure: Time plots of the first four principal components of U.S. weekly interest rates

Ruey S. Tsay Big Dependent Data 70 / 72

slide-71
SLIDE 71

Implication?

In lasso-type of analysis,

1

should we use the interest rate series directly? Even with group lasso. This leads to sparsity.

2

should we apply PCA first, then use the PCs? This leads to parsimony.

3

should we develop other possibilities? Fused lasso? Factor models?

Ruey S. Tsay Big Dependent Data 71 / 72

slide-72
SLIDE 72

Concluding Remark

1

Big dependent data appear in many applications

2

Methods developed for independent big data may fail

3

Statistical methods for big dependent data are relatively under-developed

4

Some new challenges emerge, new opportunities exist

5

Simple modifications of the traditional methods might work well

6

Both theory and methods require further research

Ruey S. Tsay Big Dependent Data 72 / 72