Big data and machine learning in macroeconomics: Some challenges and - - PowerPoint PPT Presentation

big data and machine learning in macroeconomics some
SMART_READER_LITE
LIVE PREVIEW

Big data and machine learning in macroeconomics: Some challenges and - - PowerPoint PPT Presentation

Big data and machine learning in macroeconomics: Some challenges and prospects Eleni Kalamara George Kapetanios Felix Kempf Kings College London Motivation - Macroeconomic forecasts have been, to put it mildly, receiving bad press... -


slide-1
SLIDE 1

Big data and machine learning in macroeconomics: Some challenges and prospects

Eleni Kalamara George Kapetanios Felix Kempf King’s College London

slide-2
SLIDE 2

Motivation

  • Macroeconomic forecasts have been, to put it mildly, receiving bad press...
  • Are the criticisms fair? - yes and no
  • Economic forecasters have been compared to weather forecasters.
  • But it is akin to asking a weather forecaster to predict new kinds of weather

phenomena all the time.

  • And not knowing exactly what to measure and how (eg intangibles)...
  • Structural change of a variety of forms is a big problem

2 / 66

slide-3
SLIDE 3

Motivation

  • Recent surge in data collection (Big Data era).
  • Different types of Big data: textual data, financial transactions, selected internet

searches, surveys.

  • Big Data may be able to aid in improving economic forecasts.
  • Traditional forecasting tools cannot handle the size and complexity inherent in Big

Data.

  • Econometricians have refined numerous techniques from different disciplines to digest

the ever-growing amount of data, avoid overfitting and improve forecast accuracy. − → e.g. Factor models.

  • But many issues remain.

3 / 66

slide-4
SLIDE 4

Motivation

  • We explore three important challenges on the use of machine learning and big data

for macroeconomics and provide some proposals on ways forward

  • First challenge is whether and how to incorporate big datasets in models that account

for the time series nature of macroeconomic data.

  • Second challenge is to allow for machine learning models to model structural change -

they can’t do that since most seem to be best suited for stationary data - certainly neural net ones are.

  • Third challenge is to understand the black boxes that machine learning models are.

There are approaches but we need one tailored to macroeconomics.

4 / 66

slide-5
SLIDE 5

A Time Series Model for Unstructured Data

5 / 66

slide-6
SLIDE 6

A simple model

  • Recently researchers use big unstructured datasets to improve inference on

unobserved variables and forecasting.

  • Eg payroll data to improve unemployment analysis.
  • But they construct summaries of the big dataset rather than use all of it.
  • Let

Xi = F + ǫi, i = 1, ..., N where Xi are observed, F and ǫi are unobserved, F ∼ niid(0, σ2

f ), and ǫi ∼ niid(0, σ2 i ).

We are interested in Var(F|X1, ..., XN). Is Var(F| ¯ X), ¯ X = 1

N ∑i Xi a good enough

alternative?

  • Yes but only under restrictive assumptions - σ2

i = σ2 for all i.

see details

  • We suggest a state space model that uses the full big dataset.

6 / 66

slide-7
SLIDE 7

Ratio Var(F|X1, ..., XN) / Var(F| ¯ X)

200 400 600 800 1000 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 7 / 66

slide-8
SLIDE 8

The model

The N dimensional balanced dataset: Xt = ΛFt + ξt (1) The unstructured data set: Zt = BtFt + ǫt (2)

  • Zt: kt × 1, zt = (z1t, . . . zktt)′, where kt >> T

see example

  • Importantly, kt has a time-varying dimension. There can be a different number of

events at every period and each event can be represented by a vector of different

  • dimension. (e.g number of newspaper articles or employees in a payroll, at each t)

Ft = CFt−1 + ηt (3)

8 / 66

slide-9
SLIDE 9

State Space Form

Define: Yt =

  • Xt

Zt

  • =
  • Λ

Bt

  • Ft +
  • ξt

ǫt

  • (1)

And re-write: Yt = Λ0,tFt + ζt Measurement eq Ft = CFt−1 + ηt Transition eq. where Λ0,t = (Λ, Bt)′.

9 / 66

slide-10
SLIDE 10

Model characteristics - extensions

  • Deal with missing values in unstructured way.
  • Model Estimation: Kalman Filter and Maximum Likelihood.
  • The model can represent a variety of features of the unstructured data: squares and
  • ther higher moments.
  • Xt with ragged edges.
  • The model can accommodate mixed frequencies; Xt can follow a lower frequency and

Zt a higher one.

  • Enable nowcasting and forecasting at both high and low frequency, extracting a high

frequency factor.

10 / 66

slide-11
SLIDE 11

A Monte Carlo Simulation

Different Specifications for DGP for: Zt = BtFt + ǫt:

  • indiosyncratic components ξt, ǫt: both cross-sectionally and temporally independent.

(exact model)

  • ǫt ∼ N(0, Σǫt), Σǫt = σ2

it ∗ Imax(kt), σ2 it ∈ U(1, 3).

  • ut ∼ N(0, In).
  • Assume: r = 1, n = 1, Λ0 = {1, . . . 1}
  • factor DGP: C = β ∗ Ir, β = {0.5}
  • T = 50, 100, 200,
  • max kt = 10, 50, 100, 500, 1000

11 / 66

slide-12
SLIDE 12

Comparators

Model 1: Not include zt (standard factor model), i.e: Xt = ΛFt + ξt Model 2: Y ∗

t =

Xt Z ∗

t

  • =

Λ B∗

t

  • Ft +

ξt ǫ∗

t

  • where:
  • Z ∗

t = ∑kt k=1 Zt/kt - average of the unstructured data set at each point in time t.

  • Var(ǫ∗

t ) = ¯ σ2

i,t

max kt

Keep the same factor structure

12 / 66

slide-13
SLIDE 13

Average of Relative RMSEs over 200 Monte Carlo simulations

True Parameters : β = 0.5, σ2

i ∈ U(1, 3)

Model 1 Model 2 max(kt) 10 50 100 500 1000 10 50 100 500 1000 50 0.666 0.215 0.190 0.076 0.096 0.339 0.307 0.306 0.193 0.103

T

100 0.662 0.222 0.266 0.168 0.123 0.995 0.362 0.421 0.229 0.167 200 0.280 0.243 0.261 0.181 0.102 0.461 0.386 0.409 0.246 0.139

Table: Average of relative RMSE of the HSS over Model 1 and Model 2 respectively. Model 1: does not include unstructured dataset (Zt), Model 2: includes the average of Zt

.

13 / 66

slide-14
SLIDE 14

Empirical application - Forecasting economic variable using newspaper articles scores

  • Really many papers on forecasting with factor models.
  • The unstructured data set Zt: Let M be the maximum number of articles that

appeared monthly. That said, let zk

t , be a kt × 1 vector of sentiment scores where k is

the number of articles for each point period t and kt = 1 . . . M. This implies that for s, where k < s < M the observations of zk

t are missing.

see example

  • We estimate the high dimensional state space model to extract a factor using

sentiment/uncertainty scores extracted from newspaper articles1

see text methods .

  • Benchmark: an FADL-type model of the form:

ˆ xt+h = ˆ α + ˆ βxt + ∑

j

ˆ γj · χjt where χjt : macro/fin factors (Redl, 2017).

1The sentiment on each article is measured using a dictionary based method

14 / 66

slide-15
SLIDE 15

A selection of Forecast Results for Inflation relative to FADL

Text Model h = 3 h= 6 h= 9 Loughran sentiment 0.828 ** 0.823 *** 0.850*** Harvard sentiment 0.831 *** 0.813 *** 0.850 *** vader sentiment 0.853*** 0.856*** 0.864 *** stability sentiment 0.874 0.824*** 0.845 ***

  • pinion sentiment

0.885*** 0.932 0.930 tf idf econom 0.889 * 0.865 *** 0.906 * Nyman sentiment 0.933 *** 0.964*** 0.972 economcounts 0.938** 0.933** 0.965 ** tf idf uncert 0.939 0.951 0.934 alexopoulos 09 0.964 0.953 0.973 *** uncertaincounts 0.973 0.951 0.971 Afinn sentiment 0.985 1.004 1.001 baker bloom davis 1.001 0.967 0.973 husted 1.001 0.979 0.983

Table: relative RMSEs, based on the estimated factor using a text method.* Denotes rejection at the 10% level, ** at the 5% level and *** 1 (D-M test)

15 / 66

slide-16
SLIDE 16

A selection of Forecast Results for GDP growth relative to FADL

Text Model h = 3 h= 6 h= 9 Harvard sentiment 0.861** 0.745 0.688 Loughran sentiment 0.881 0.803 0.754 economcounts 0.905 0.852 0.829

  • pinion sentiment

0.910 0.855 0.824 stability sentiment 0.922 0.865 0.835 uncertaincounts 0.925 0.893 0.883 tf idf econom 0.926 0.871 0.845 vader sentiment 0.929* 0.863 0.809 alexopoulos 09 0.943 0.919 0.913 tf idf uncert 0.957 0.922 0.919 Nyman sentiment 0.960 0.923 0.893 Afinn sentiment 0.975 0.970 0.967 husted 0.979 0.972 0.986 baker bloom davis 0.983 0.978 0.980

Table: relative RMSEs, based on the estimated factor using a text method.* Denotes rejection at the 10% level, ** at the 5% level and *** 1 (D-M test)

16 / 66

slide-17
SLIDE 17

Time Variation in Machine Learning Models

17 / 66

slide-18
SLIDE 18

Idea

  • Extend Machine Learning models to fit to the applied time series setting and account

for structural breaks using the kernel based approach of Giraitis et al. (2014).

  • Focus on the regression-like tools because they are the most natural for

macroeconomic applications. In particular, examine the support vector regressor (SVR) (Vapnik, 1998) and neural nets (Friedman et al., 2001) and propose a ground theoretical framework to allow for structural breaks.

18 / 66

slide-19
SLIDE 19

Time-varying neural nets

A general definition of a multi-layer (deep) neural network follows: Let x = (x1, . . . , xp)′ be the input vector. Let h1 . . . hL be vectors of activation functions

see types for each of the L (hidden) layers of

the network representing non-linear transformations of the data. Denote by gl the l-layer which is a vector of functions of length equal to the number of Jl nodes in that layer, such that g0 = x. The overall structure of the network is equal to: G = gL(gL−1(. . . (g0(.)) where: gl(x) = W1,lhl (W2,lx + bl) ∀1 ≤ l ≤ L and W1,l, W2,l and bl are matrices and vectors of weight parameters.

19 / 66

slide-20
SLIDE 20

Time-varying neural nets

The model can then be written as (Friedman et al., 2001): yt = G

  • xt, β0

+ εt, t = 1, ..., T (2) where xt is p × 1, β0 is k × 1 and contains all model parameters and G denotes the overall nonlinear mapping. We estimate this model by penalised least squares, i.e. ˆ β = arg min

β

y − G (X, β)2

2

T + λ β1 . where y = (y1, ..., yT )′ and G (X, β) = (G (x1, β) , ..., G (xT, β))′

20 / 66

slide-21
SLIDE 21

Time-varying neural nets

Let the model now be extended to the case yt = G

  • xt, β0

t

  • + εt

(3) where β0

t is a persistent, bounded, possibly stochastic process and:

  • β0

t − β0 s

  • 2

≤ C |t − s| min(t, s) sup

s≤h≤t

  • β0

h

  • , for some C > 0.

(4) We estimate this model by time varying penalised least squares, i.e. ˆ βt = arg min

β

y − G (X, β)2

wt,2

T + λ β1 . where y − G (X, β)2

wt,2 = T

j=1

wt,j

  • yj − G
  • xj, β

2 , and wt,j = K

  • t−j

H

  • , for some kernel function K and bandwidth H = o (T), H → ∞.

21 / 66

slide-22
SLIDE 22

Time-varying neural nets

  • There is considerable work in the above setting where β0

t is allowed to be stochastic.

  • In a series of papers, Giraitis, Kapetanios et al (2014 (JoE), 2015 (JAE), 2018 (JTSA))

show that kernel based estimation of β0

t , in many contexts (regression, ML) is

consistent and asymptotically normal, even if β0

t is stochastic and satisfies a

smoothness condition.

  • Recently Dendramis, Giraitis and Kapetanios (2019) extend this to a large dimensional

setting, providing sharp probability exponential inequalities for weighted, randomly scaled sums of mixing and possibly fat-tailed data that allows time varying estimation

  • f large covariance matrices.

22 / 66

slide-23
SLIDE 23

Time-varying neural nets

Theorem

Let model (3) with condition (4) hold and let (i) εt be a martingale difference process that is independent of xt and (ii) G be a function with bounded first derivatives. Then, for all t,

  • G
  • X, ˆ

βt

  • − G
  • X, β(T),0
  • 2

wt,2

T = Op log k H 1/2 sup

t

  • β0

t

  • 1
  • .

(5) where β(T),0 = (β0

1′, ..., β0 T ′)′ and G

  • X, β(T),0

= (G

  • x1, β0

1

  • , ..., G
  • xT, β0

T

  • )′

23 / 66

slide-24
SLIDE 24

Time - varying SVR

  • Let xt = [x1t, . . . xNt]′ be the vector of covariates and yt be the target variable.

yt = β0′xt + εt. (6)

  • Recall that SVR operates by solving (Vapnik et al. (1997), pg. 156-158) :

ˆ β = min

β β2 , s.t.

yt − β′xt ≤ ǫ + ξt, β′xt − yt ≤ ǫ + ξ∗

t , ξ∗ t , ξt ≥ 0

where ǫ denotes a preselected error margin parameter (tuning parameter) and ξ∗

t , ξt

are slack variables.

24 / 66

slide-25
SLIDE 25

Time - varying SVR

Dual Formulation of the problem:

max

α,α∗

T

i,j=1

(αi − α∗

i )

  • αj − α∗

j

  • x′

ixj − ǫ T

i=1

(αi + α∗

i ) + T

i=1

(αi + α∗

i ) yi

  • s.t.

T

i=1

(αi − α∗

i ) = 0 and αi, α∗ i ≥ 0.

Then, ˆ βt =

T

i=1

(αti − α∗

ti) xi

The value of parameter ǫ defines a margin of tolerance where no penalty is given to the errors. Thus, the formulation of the problem can be viewed as a penalised optimisation procedure where there is a positive value which controls the penalty imposed on observations that lies outside the ǫ and helps to prevent overfitting (Steinwart and Christmann, 2008).

25 / 66

slide-26
SLIDE 26

Time - varying SVR

Following Giraitis et al. (2014); Kapetanios and Zikes (2018), we incorporate weights wt,j where wt,j = K

  • t−j

H

  • , for some kernel function K which is centered on the time-point of

interest and decays for more distant observations and bandwidth H = o (T) , H → ∞, the

  • ptimization problem becomes:

max

αt,α∗

t

T

i,j=1

wt,jwt,i (αi − α∗

i )

  • αj − α∗

j

  • x′

ixj−

ǫ

T

i=1

wt,i (αi + α∗

i ) + T

i=1

wt,i (αi + α∗

i ) yi

  • (7)

s.t.

T

i=1

(αi − α∗

i ) = 0 and αi, α∗ i ≥ 0, T

i=1

wt,i = T Then, ˆ βt =

T

i=1

wt,i (αti − α∗

ti) xi.

26 / 66

slide-27
SLIDE 27

Empirics - Forecasting US Industry portfolios

Dataset

  • Targets: Three Standard industry portfolios of U.S. equities. 2
  • Cnsm: Consumer Durables, NonDurables, Wholesale, Retail.
  • Manuf: Manufacturing, Energy, Utilities.
  • HiTech: Business Equipment, Telephone, Television Transmission.
  • Predictors: “Zoo factors” of Feng et al. (2019) 3.
  • Full sample length: 1976m1 - 2017m10: Evaluation period starts in 2002.

2Focus on portfolios rather than individual assets because they have more stable betas, higher

signal-to-noise ratios, and are less prone to missing data issues. Data from Kenneth French’s website.

3150 risk factors at the monthly frequency for the period from July 1976 to December 2017

27 / 66

slide-28
SLIDE 28

Forecasting US industry portfolios

Table: Out of sample relative RMSEs using time-varying ML and standard ML model: Benchmark: AR(1)

Steps Ahead (1) (3) (6) (9) (12) Cnsm TVSVM 0.750 0.735 0.734 0.733 0.829 SVM 0.868 0.844 0.830 0.839 0.831 TVNN 0.940 1.000 1.000 0.890 0.983 NN 1.106 1.030 1.090 0.925 1.050 TV-BOOST 1.01 1.299 1.066 0.991 0.966 Manuf TVSVM 0.684 0.864 0.800 0.605 0.798 SVM 0.892 0.871 0.809 0.811 0.793 TVNN 1.035 1.068 0.947 0.984 0.887 NN 0.968 1.299 0.991 0.944 0.936 TV-BOOST 1.026 0.976 0.960 0.976 0.978

28 / 66

slide-29
SLIDE 29

Forecasting US industry portfolios

Table: Out of sample relative RMSEs using time-varying ML and standard ML model: Benchmark: AR(1)

Steps Ahead (1) (3) (6) (9) (12) HiTech TVSVM 0.608 0.604 0.805 0.809 0.813 SVM 0.829 0.811 0.803 0.822 0.821 TVNN 0.982 0.987 0.997 0.964 0.908 NN 0.926 0.927 0.926 0.944 0.983 TV-BOOST 1.032 0.981 0.991 0.992 0.989

29 / 66

slide-30
SLIDE 30

Empirics 2 - UK GDP growth

  • target: Monthly UK GDP growth, predictors: large panel of survey data.
  • We use the time-varying ML models to forecast monthly UK GDP at h = 1, 3, 6, 9, 12

step ahead using a large panel of survey indicators.

  • Full sample length: 2000m1 - 2018m8
  • Reserve 92 months for forecast evaluation (post crisis)
  • Derive direct forecasts generated by the time-varying neural nets and time-varying

support vector regressions and compare with standard AR(1). For comparison, forecasts are also derived using the standard neural nets and support vector regressors under the same specifications.

30 / 66

slide-31
SLIDE 31

Empirical Application - GDP growth

In-sample period: 2000m1-2009m12 Model h = 1 h = 3 h= 6 h= 9 h= 12 TVNN 0.750*** 0.698* 0.742* 0.840 0.825* NN 0.843 0.721 0.778 0.843 0.900*** TVSVR 0.945 0.925 0.887 0.915 0.987 SVR 0.716 0.646*** 0.664*** 0.696* 0.764

Table: Average RMSEs at h=1,3,6,9,12 for the time-varying and the standard ML models relative to the AR(1). *, **, *** are the results from Diebold and Mariano (1995) test with Harvey’s (1997) adjustment for predictive accuracy.* Denotes rejection at the 10% level, ** at the 5% level and *** 1% level.

31 / 66

slide-32
SLIDE 32

Interpretability of Deep Neural Networks

32 / 66

slide-33
SLIDE 33

Key Summary

We provide an alternative toolbox to interpret deep neural networks in the context of macroeconomic modelling. In this presentation, we

  • propose the use of partial derivatives to enhance model interpretability.
  • introduce a non-linear and time-varying impulse response analysis.
  • provide a first and preliminary empirical example (US economy).

One of the key criticisms of (deep) neural networks is the limited scope for model

  • interpretability. With this project, we aim to shed some light on this black-box.

In particular, first preliminary results look quite interesting, so that we pursue this idea

  • further. For example, we find that
  • for certain dependent variables, input variable influence only changes during specific

points in time (e.g.times of increased volatility) while it remains constant otherwise. For other dependent variables, this variable influence seems to change more frequently.

  • non-linear impulse responses can be supported by economic theory.

33 / 66

slide-34
SLIDE 34

Interpretability in AI or ML in general (1/2)

No clear definition for interpretability, but can be summarised as: the degree to which a human can understand the cause of a decision, see Miller (2018). Why do we care at all about interpretability if accuracy is competitive? Single metric such as OOS-MSE is an incomplete description, e.g. see Doshi-Velez and Kim (2017). There are, however, many other reasons, why interpretability is relevant (e.g. see Molnar (2019)), including:

  • scientific research
  • safety measure
  • detecting biases
  • debugging

34 / 66

slide-35
SLIDE 35

Interpretability in AI or ML in general (2/2)

There already exist different approaches to generally enhance model interpretability in ML, examples include

  • Partial Dependence, e.g. see Friedman (2001).
  • Individual Conditional Expectation, e.g. see Goldstein et al. (2015).
  • Accumulated Local Effect, e.g. see Apley (2016).
  • Shapley Values, e.g. see Joseph (2019).

In this project, we turn our focus explicitly to neural networks in macroeconomics.

35 / 66

slide-36
SLIDE 36

Motivation

The central motivation for this research is two-fold: 1 Universal approximation:

  • We know that multilayer feed-forward networks are universal approximators, e.g. see

Hornik et al. (1989)

2 Time-varying effects:

  • Through their inherent non-linearity, neural networks can display time-varying effects,

e.g. see Kapetanios (2007)

We therefore hope to make meaningful contributions to current debates by offering time-varying analysis tools.

36 / 66

slide-37
SLIDE 37

Neural Network Setup (1/3)

In its most fundamental form, we describe an arbitrary economic process as yt = E(yt|xt−1) + ǫt, (8) where E(yt|xt−1) = g(xi,t−1, θ), (9) is a (non-)linear approximation for the true but unknown DGP. Note the lag in x which is supported by the fact of publication delays but also by idiosyncratic persistence. Moreover, we allow x to include lags of y. All model parameters including the weights and biases as well as hyper-parameters are denoted by θ. What is g(·)?

37 / 66

slide-38
SLIDE 38

Neural Network Setup (2/3)

We investigate the case where we approximate g(·) with a neural network. In general, for feedforward networks, g(·) takes the form g(X, Θ) = σL(...σ2(W2

T σ1(W1 TX + b1) + b2)),

(10) where σ are activation functions, l = 1, 2, ..., L denotes the number of layers. While the weights W and biases b are summarised in Θ, all model parameters, including weights, biases and other hyper-parameters, are denoted by θ, e.g. see Goodfellow et al. (2016). The considered loss function is ˜ L(Y, X, θ) = L(Y, X, Θ) + λΩ(Θ), with L(Y, X, Θ) = (g(X, Θ) − Y)T (g(X, Θ) − Y). (11) Moreover, with L1, respectively L2 regularisation, the loss function becomes ˜ L(Y, X, θ) = (g(X, Θ) − Y)T (g(X, Θ) − Y) + λ||Θ||1 (12) ˜ L(Y, X, θ) = (g(X, Θ) − Y)T (g(X, Θ) − Y) + 1 2λΘT Θ (13)

38 / 66

slide-39
SLIDE 39

Neural Network Setup (3/3)

In this first preliminary draft, we consider:

  • one-layer networks
  • non-linear activation functions (e.g. tanh, relu)
  • stochastic gradient descent
  • 90%-10% training-test split, with 20% validation split
  • fixed window (other alternatives such as rolling or expanding are considered for future

drafts)

  • random grid search for hyperparameter tuning
  • L1 and/or L2 regularisation for the weights

Many other specifications are imaginable!

39 / 66

slide-40
SLIDE 40

Variable Influence (1/2)

We propose the usage of partial derivatives at each point in time to evaluate variable influence over time. Iij,t = ∂gj(X, θ) ∂xi,t−p (14) Note that in this preliminary draft, each target variable j (j = 1, 2, ..., N) has its own g(·). Alternatives are also imaginable, where the network output could be multidimensional. Due to the inherent non-linearity of the neural network we expect the derivative to vary

  • ver time.

The motivation for using the partial derivative is that the derivative can be interpreted as the marginal influence each input variable has on the process at each point in time. It can also be used to test Granger causality.

40 / 66

slide-41
SLIDE 41

Variable Influence (2/2)

Moreover, we propose the usage of confidence bands in extension to equation (14). In particular, we propose using the moving-block bootstrap methodology, where we draw T − b + 1 overlapping blocks from the original data, where T is the total number of

  • bservations per variable and b denotes the block-length. From these blocks, T /b blocks

are drawn at random with replacement and aligned following the order they were drawn with to build the bootstrapped observations, e.g. see Kunsch (1989). For each bootstrapped dataset, we fit a neural network as described before. We then calculate the partial derivatives, but with respect to the original input data: IijB,t = ∂gj,B(·) ∂xi,t−p , (15) where B indicates the respective bootstrap replication.

41 / 66

slide-42
SLIDE 42

Non-linear impulse response functions

Similarly to the methodology applied in a linear setting, we propose the usage of impulse response functions (IRF) in the context of neural networks. Generally, we apply the framework IRF(h, ν, Ω) = E[yt+h|νt, Ωt−1] − E[yt+h|Ωt−1], (16) where νt denote structural shocks at time t, e.g. see Koop et al (1997). Given yt = g(xi,t−1, θ) + ǫt, (17) with E[ǫtǫ′

t] = Σ

(18) being a symmetric and positive definite covariance matrix whose off-diagonals are non-zero. We apply a Cholesky Decomposition of the covariance matrix such that Σ = PP′, where P is a lower-triangular matrix. It follows νt = Put (19) with ut being reduced form residuals, e.g. see Sims (1980).

42 / 66

slide-43
SLIDE 43

US Economy – Overview

We consider the US economy as an empirical example with GDP, inflation, unemployment, export prices, S&P500 returns and Fed Fund rates as our dependent, with their lagged values as explanatory variables. The dataset ranges from Q4 1984 to Q2 2019. In particular, we include export prices as another price variable to avoid encountering price puzzle (e.g. see Sims (1992), Buch et al. (2014), Balke et al. (1994)). However, we find that an impulse response analysis (e.g. using VAR) is occasionally still displaying the price puzzle dependent on where we split the data. We perform variable transformation to make them stationary. In this draft, we feed the network the already transformed data. However, in next steps we will experiment with standardised and / or raw data. Since the Cholesky Decomposition is sensitive to variable ordering we use the following

  • rdering:

GDP → CPI → Unemployment → Export Prices → S&P 500 → Rates

43 / 66

slide-44
SLIDE 44

US Economy – Predictions

Figure: Full Sample (in- and out-of-sample) predictions

44 / 66

slide-45
SLIDE 45

US Economy – Variable Influence (CPI)

Figure: Partial Derivatives: CPI

45 / 66

slide-46
SLIDE 46

US Economy – Variable Influence (GDP)

Figure: Partial Derivatives: GDP

46 / 66

slide-47
SLIDE 47

US Economy – Variable Influence (Rates)

Figure: Partial Derivatives: Rates

47 / 66

slide-48
SLIDE 48

US Economy – Variable Influence Interpretation

  • We find, that variable influence as displayed by the partial derivative is sensitive to
  • time. We can therefore confirm our initial hypothesis.
  • In particular, we find that characteristic variable influences vary for each dependent

variable.

  • In the case of GDP, for example, the network is most sensitive to changes in lagged

values of GDP most of the time. It is only during times of increased volatility that the network also becomes more sensitive to changes in other explanatory variables.

  • For rates, on the other hand, the neural network seems to be much more sensitive to

changes in almost all variables. In absolute values, CPI and Unemployment seem to have the largest influence.

48 / 66

slide-49
SLIDE 49

US Economy – Impulse Responses (Shock in Rates, Response in CPI)

  • We find that negative

shocks in rates lead to positive responses in CPI as supported by economic theory.

  • The magnitude of the

response is time-dependent.

  • During the GFC, a

negative shock in rates cannot offset the effect of the crisis and CPI falls despite reduced rates.

49 / 66

slide-50
SLIDE 50

US Economy – Impulse Responses (Shock in GDP, Response in Rates)

  • We find that rates

respond negatively to a negative shocks in GDP as expected by economic theory.

  • We find that a shock

prior the GFC leads to a lower level in rates than post the crisis.

50 / 66

slide-51
SLIDE 51

US Economy – Impulse Responses (Shock in CPI, Response in Rates)

  • We find that rates

respond negatively to a negative shock in CPI.

  • This is supported by

economic theory.

  • In particular, the

effects of a shock in CPI seem to be more distinct prior the GFC.

51 / 66

slide-52
SLIDE 52

Conclusion

We find that both partial derivatives and non-linear impulse responses can help to shed some light on economic theory. First preliminary results look somewhat interesting, so we will pursue this project further. There is room for further improvement, in particular with regard to model tuning and selection.

52 / 66

slide-53
SLIDE 53

Thank you

Thank you

53 / 66

slide-54
SLIDE 54

References I

Alexopoulos, M., Cohen, J., et al. (2009). Uncertain times, uncertain measures. University of Toronto Department of Economics Working Paper, 352. Apley, D. W. (2016). Visualizing the effects of predictor variables in black box supervised learning models. arXiv preprint arXiv:1612.08468. Baker, S. R., Bloom, N., and Davis, S. J. (2016). Measuring economic policy uncertainty. The Quarterly Journal of Economics, 131(4):1593–1636. Balke, N. S., Emery, K. M., et al. (1994). Understanding the price puzzle. Federal Reserve Bank of Dallas Economic Review, Fourth Quarter, pages 15–26. Buch, C. M., Eickmeier, S., and Prieto, E. (2014). Macroeconomic factors and microlevel bank behavior. Journal of Money, Credit and Banking, 46(4):715–751. Correa, R., Garud, K., Londono, J. M., Mislang, N., et al. (2017). Constructing a dictionary for financial stability. Technical report, Board of Governors of the Federal Reserve System (US).

54 / 66

slide-55
SLIDE 55

References II

Doshi-Velez, F. and Kim, B. (2017). Towards a rigorous science of interpretable machine

  • learning. arXiv preprint arXiv:1702.08608.

Feng, G., Giglio, S., and Xiu, D. (2019). Taming the factor zoo: A test of new factors. Technical report, National Bureau of Economic Research. Friedman, J., Hastie, T., and Tibshirani, R. (2001). The elements of statistical learning, volume 1. Springer series in statistics New York. Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232. Gilbert, C. H. E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International Conference on Weblogs and Social Media (ICWSM-14). Available at (20/04/16) http:/ /comp. social. gatech. edu/papers/icwsm14.

  • vader. hutto. pdf.

Giraitis, L., Kapetanios, G., and Yates, T. (2014). Inference on stochastic time-varying coefficient models. Journal of Econometrics, 179(1):46–65.

55 / 66

slide-56
SLIDE 56

References III

Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics, 24(1):44–65. Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning, volume 1. MIT press Cambridge. Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366. Hu, G., Bhargava, P., Fuhrmann, S., Ellinger, S., and Spasojevic, N. (2017). Analyzing users’ sentiment towards popular consumer industries and brands on twitter. In Data Mining Workshops (ICDMW), 2017 IEEE International Conference on, pages 381–388. IEEE. Hu, M. and Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168–177. ACM.

56 / 66

slide-57
SLIDE 57

References IV

Husted, L. F., Rogers, J., and Sun, B. (2017). Monetary policy uncertainty. International Finance Discussion Papers 1215, Board of Governors of the Federal Reserve System (U.S.). Joseph, A. (2019). Shapley regressions: A framework for statistical inference on machine learning models. arXiv preprint arXiv:1903.04209. Kapetanios, G. (2007). Measuring conditional persistence in nonlinear time series. Oxford Bulletin of Economics and Statistics, 69(3):363–386. Kapetanios, G. and Zikes, F. (2018). Time-varying lasso. Economics Letters, 169:1–6. Kunsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. The annals of Statistics, pages 1217–1241. Loughran, T. and McDonald, B. (2013). Ipo first-day returns, offer price revisions, volatility, and form s-1 language. Journal of Financial Economics, 109(2):307–326. Miller, T. (2018). Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence.

57 / 66

slide-58
SLIDE 58

References V

Molnar, C. (2019). Interpretable machine learning. Lulu. com. Nielsen, F. ˚

  • A. (2011). A new anew: Evaluation of a word list for sentiment analysis in
  • microblogs. arXiv preprint arXiv:1103.2903.

Nyman, R., Kapadia, S., Tuckett, D., Gregory, D., Ormerod, P., and Smith, R. (2018). News and narratives in financial systems: exploiting big data for systemic risk assessment. Bank of England Staff Working Papers, 704. Sims, C. A. (1980). Macroeconomics and reality. Econometrica: journal of the Econometric Society, pages 1–48. Sims, C. A. (1992). Interpreting the macroeconomic time series facts: The effects of monetary policy. European economic review, 36(5):975–1000. Steinwart, I. and Christmann, A. (2008). Support vector machines. Springer Science & Business Media. Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock

  • market. The Journal of finance, 62(3):1139–1168.

58 / 66

slide-59
SLIDE 59

References VI

Vapnik, V., Burges, C. J., Kaufman, L., Smola, A. J., and Drucker, H. (1997). Support vector regression machines. In Advances in neural information processing systems, pages 155–161.

59 / 66

slide-60
SLIDE 60

Appendix

60 / 66

slide-61
SLIDE 61

Example of spreadsheet of an unstructured dataset

Go back 61 / 66

slide-62
SLIDE 62

Text methods: Algorithm-based metrics

Positive and negative dictionary Boolean Computer science-based Financial stability (Correa et al., 2017) Economic Uncertainty (Alexopoulos et al., 2009) VADER sentiment (Gilbert, 2014) Finance oriented (Loughran and McDonald, 2013) Monetary policy uncertainty (Husted et al., 2017) ‘Opinion’ sentiment (Hu et al., 2017; Hu and Liu, 2004) Afinn sentiment (Nielsen, 2011) Economic Policy Uncertainty (Baker et al., 2016) Punctuation economy (this paper:

See details )

Harvard IV (used in Tetlock (2007)) Anxiety-excitement (Nyman et al., 2018) Single word counts of “uncertain” and “econom” tf-idf applied to “uncertain” and “econom”

Go back 62 / 66

slide-63
SLIDE 63

Average of Relative RMSEs over 500 Monte Carlo simulations

True Parameters : σ = 1, q = 0.9 Model 1 Model 2 T maxkt 100 500 1000 100 500 1000 50 0.928 0.750 0.640 0.928 0.757 0.635 100 0.920 0.741 0.642 0.914 0.742 0.640 400 0.921 0.742 0.635 0.920 0.755 0.640 1000 0.914 0.742 0.6357 0.921 0.748 0.647

Table: Average of RMSEs of the High Dimensional state space relative to the comparator models. Model1: not include unstructured dataset (Zt), Model2: includes the average of Zt

63 / 66

slide-64
SLIDE 64

Popular activation functions

  • Logistic function: h(x) =

1 1+exp −x

  • hyperbolic tangent (tanh) function: h(x) = tanh(x)
  • Rectified Linear Unit (RELU) function: h(x) = max(0, x)

go back 64 / 66

slide-65
SLIDE 65

Conditional Variances

Var(F|X1, . . . XN) = ΣFF − ΣFXΣ−1

XXΣXF. Given that:

Var(Xi) = 1 + σ2

i ,

Var( ¯ X) = 1 + 1

N2 ∑N i=1 σ2 i ,

Cov(Xi, F) = 1, Cov( ¯ X, F) = 1, we have Var(F| ¯ X) = 1 − 1 1 + 1

N2 ∑N i=1 σ2 i

, And applying the Sherman-Morrison formula : Var(F|X1, ..., XN) = 1 −

N

i=1

1 σ2

i

+ ∑N

i=1 ∑n j=1 1 σ2

i σ2 j

1 + ∑N

i=1 1 σ2

i

.

go back 65 / 66

slide-66
SLIDE 66

Conditional Variances

It holds that: Var(F| ¯ X) ≥ Var(f|X1, XN),i.e:,

N

i=1

1 σ2

i

− ∑N

i=1 ∑N j=1 1 σ2

i σ2 j

1 + ∑N

i=1 1 σ2

i

≥ 1 1 + 1

N2 ∑N i=1 σ2 i

If σ2

i = σ2, for all i ∈ N,

N σ2 −

N2 σ4

1 + N

σ2

− 1 1 + σ2

N

≥ 0 (20) We set α = N

σ2 . Then:

α − α2 1 + α − 1 1 + 1

α

≥ 0

  • r

α + α2 + 1 + α − α2 − α − 1 − α ≥ 0 But α + α2 + 1 + α − α2 − α − 1 − α = 0

go back 66 / 66