Augmenting simple models with machine learning Jim Savage Data - - PowerPoint PPT Presentation

augmenting simple models with machine learning
SMART_READER_LITE
LIVE PREVIEW

Augmenting simple models with machine learning Jim Savage Data - - PowerPoint PPT Presentation

Augmenting simple models with machine learning Jim Savage Data Science Lead Lendable @khakieconomist Cheers to Sarah Tan David Miller Chris Edmond Eugene Dubossarsky Outline Estimating causal relationships Proximity


slide-1
SLIDE 1

Augmenting simple models with machine learning

Jim Savage Data Science Lead Lendable @khakieconomist

slide-2
SLIDE 2

Cheers to

  • Sarah Tan
  • David Miller
  • Chris Edmond
  • Eugene Dubossarsky
slide-3
SLIDE 3

Outline

  • Estimating causal relationships
  • Proximity score matching
  • The problem of model non-stationarity in time

series models

slide-4
SLIDE 4
slide-5
SLIDE 5

Problem #1: drawing causal inference from observational data

Question for the audience: What is a college degree worth? How would you go about estimating it?

slide-6
SLIDE 6

Experimental vs

  • bservational data

Experimental data = easy causal inference Observational data = hard “causal” inference

  • We want to know E(dy|exogenous intervention in

X)— how much we expect y to change

  • We have never observed an exogenous

intervention in X

slide-7
SLIDE 7

Not a predictive problem!

  • Predictive models give us E(y|X)—fancy

correlation

  • Looks the same, but is wildly different
  • In absence of causal reasoning, more data &

fancier models often just make us more certain

  • f the wrong answer.
slide-8
SLIDE 8

Neyman-Rubin causal model

The fundamental problem of causal inference is that booting up a parallel universe whenever we want to draw causal inference is too much work.

slide-9
SLIDE 9

How do we estimate treatment effects?

  • Regression with controls (try to take care of the

effect of observed confounders)

  • Panel data
  • Natural experiments
  • Matching
slide-10
SLIDE 10

Regression helps us deal with observed confounders

slide-11
SLIDE 11

But be careful what you control for

slide-12
SLIDE 12

Multiple observations of the same unit over time can help control for unobserved confounders that don’t vary over time

slide-13
SLIDE 13

IV & natural experiments can help… and are difficult to find, impossible to verify

slide-14
SLIDE 14

Pros and cons

  • All the above methods are better than comparing

averages!

  • Often no good natural experiments exist (makes

IV hard!)

  • Often we’re worried that unobserved confounders

vary over time (fixed effects assumption violated)

  • Decisions still have to be made
slide-15
SLIDE 15

Matching methods

  • Idea: build up a control group that is as similar

as possible to the treatment group

  • Run your analysis (comparison of groups,

regression, etc) on this sub-group. Discard those who were never likely to take up treatment.

slide-16
SLIDE 16

Matching methods

  • Idea: build up a control group that is as similar

as possible to the treatment group

  • Once you have this “synthetic control”, use

some causal model.

  • Pray it has balanced your groups on the factors

that matter.

slide-17
SLIDE 17

Exact matching

  • Pair treated observation with untreated
  • bservation that is the same on observed

covariates.

  • Run out of dimensions very quickly…
slide-18
SLIDE 18

Matching using a metric

  • Define some metric for matching (Euclidian,

Mahalanobis, etc.)

  • Group observations that are “close” in the X

space

  • Run analysis on this subset.
  • But which Xs matter?
slide-19
SLIDE 19

Propensity score matching

  • Estimate model of “propensity to get treatment”
  • p(treated | X)
  • For each treated observation, choose an

untreated observation whose modelled propensity is closest (or some other matching technique).

slide-20
SLIDE 20
slide-21
SLIDE 21

Propensity score matching

  • Big problem: “Smith-Todd”
  • Change your propensity model, change your

treatment effect. Can be meaningless.

  • Despite this, very widely used (~15k citations)
slide-22
SLIDE 22

Proximity matching

  • Like metric matching, we match on the Xs
  • Like propensity score matching, we take into

account how the Xs affect treatment probability.

  • Use the Random Forest proximity matrix
slide-23
SLIDE 23

CART

slide-24
SLIDE 24

The Random Forest

  • Essentially a collection of CART models
  • Each estimated on a random subset of the

data

  • In each node, a sample of possible Xs drawn

to be considered for a split

  • Each tree fairly different.
slide-25
SLIDE 25

Random forest proximity

  • When two people end up in the same terminal node of a

tree, they are said to be proximate

  • The proximity score (i, j) is the proportion of terminal

nodes shared by individuals i and j.

  • We calculate it on held-out observations
  • It is a measure of similarity between two individuals in

terms of their Xs

  • But only the similarity in terms of the Xs that matter

to y

  • A metric-free, scale invariant supervised similarity

score

slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28

Introduction to analogy weighting

  • Motivation 1: want parameters most relevant to

today.

  • Motivation 2: want to know when model is least

likely to do a good job.

slide-29
SLIDE 29
slide-30
SLIDE 30

Unprincipled approaches

slide-31
SLIDE 31

Analogy weighting: the idea

  • Train a random forest on the dependent variable
  • f interest with potentially many Xs
  • Take the proximity matrix from the random forest
  • Use the relevant row from this matrix to weight

the observations in your parametric model

  • This is akin to training your model on the

relevant history

slide-32
SLIDE 32
slide-33
SLIDE 33

Implementing

  • For very simple models, canned functions

normally take a weights argument.

  • For complex models, weights are not normally

included.

  • Use Stan
  • Direct call to increment_log_prob rather than

sampling notation

slide-34
SLIDE 34

?

When should I ignore my model?

slide-35
SLIDE 35

And when history is not relevant?

slide-36
SLIDE 36

Covariance in scale- correlation form

Σ = diag(σ)Ωdiag(σ)

  • Here, sigma is a vector of standard deviations, and Omega

is a correlation matrix

  • We can give sigma a non-negative prior (say, half Cauchy),

and Omega an LKJ prior

  • LKJ is a one-parameter distribution of correlation matrices.
  • Low values of the parameter give (approaching 1) give

uniform prior over correlations.

  • High values (approaching infinity) give an identity matrix.
slide-37
SLIDE 37

Application: volatility modelling during financial crisis

  • Most volatility models work like so:

returns vector(t) ~ multivariate distribution(expected return(t), covariance(t))

  • Expected returns model is just a forecasting model
  • Covariance needs to be explicitly modelled
  • Multivariate GARCH common.
  • CCC Garch allows time varying shock

magnitudes

  • DCC allows time varying correlations that

update with correlated shocks

slide-38
SLIDE 38

LKJ as a “danger prior” in volatility models

  • Idea: when we have relevant histories, we learn

correlation structure from the data.

  • When we have no relevant history, our likelihood

does not impact the posterior and we revert to the prior.

  • Using an LKJ prior with low degrees of

freedom gives us highly correlated returns in unprecedented states.

slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41

Questions?