Optimally Combining Outcomes to Improve Prediction David Benkeser - - PowerPoint PPT Presentation

optimally combining outcomes to improve prediction
SMART_READER_LITE
LIVE PREVIEW

Optimally Combining Outcomes to Improve Prediction David Benkeser - - PowerPoint PPT Presentation

Optimally Combining Outcomes to Improve Prediction David Benkeser benkeser@berkeley.edu UC Berkeley Biostatistics November 15, 2016 Acknowledgments Collaborators: Mark van der Laan, Alan Hubbard, Ben Arnold, Jack Colford, Andrew Mertens,


slide-1
SLIDE 1

Optimally Combining Outcomes to Improve Prediction

David Benkeser benkeser@berkeley.edu

November 15, 2016 UC Berkeley Biostatistics

slide-2
SLIDE 2

Acknowledgments

Collaborators: Mark van der Laan, Alan Hubbard, Ben Arnold, Jack Colford, Andrew Mertens, Oleg Sofyrgin Funding: Bill and Melinda Gates Foundation OPP1147962

slide-3
SLIDE 3

Motivation

The Gates Foundation’s HBGDki is a program aimed at improving early childhood development [1] A “precision public health” initiative – getting the right child, the right intervention, at the right time.

slide-4
SLIDE 4

Cebu Study

The Cebu Longitudinal Health and Nutrition Survey enrolled pregnant women between 1983 and 1984. [2] Children were followed every 6 months for two years after birth and again at ages 8,11,16,18,21. Research is focused on the long-term effects of prenatal and early childhood exposures on later outcomes.

slide-5
SLIDE 5

Cebu Study

Questions:

  • 1. Can a screening tool be constructed using information

available in early childhood to predict neurocognitive deficits later in life?

  • 2. What variables improve prediction of neurocognitive
  • utcomes?
  • 3. Do somatic growth measures improve predictions of

neurocognitive outcomes?

slide-6
SLIDE 6

Observed data

The observed data are n = 2, 166 observations O = (X, Y). X = D covariates Early childhood measurements: health care access, sanitation, parental information, gestational age Y = J outcomes Achievement test scores at 11 years old: Math, English, Cebuano

slide-7
SLIDE 7

Combining test scores

Reasons to combine test scores into single score:

  • 1. No scientific reason to prefer one score
  • 2. Predicting deficit in any domain is important
  • 3. Avoid multiple comparisons
  • 4. Improve prediction?
slide-8
SLIDE 8

Existing methods

Scores could be standardized and summed: Zi = ∑

j

Yi − ¯ Yj ˆ σj Downsides:

  • 1. Somewhat ad-hoc
  • 2. Outcomes may not be strongly related to X
slide-9
SLIDE 9

Existing methods

Principal components/factor analysis of Y Perform some transformation of Y Look at eigenvalues/scree plots/etc... to choose factors Decide on linear combination of factors Downsides:

  • 1. Very ad-hoc
  • 2. Difficult to interpret/explain
  • 3. Outcomes may not be strongly related to X
slide-10
SLIDE 10

Existing methods

Supervised methods, e.g., canonical correlation, redundancy analysis, partial least squares, etc... Downsides:

  • 1. Difficult to interpret
  • 2. Outcomes not naturally combined into single score
  • 3. Inference not straightforward
slide-11
SLIDE 11

Combining scores

What are characteristics of a good composite score?

  • 1. Simple to interpret
  • 2. Reflect the scientific goal (prediction)
  • 3. Procedure can be fully pre-specified

Consider a simple weighted combination of outcomes, Yω = ∑

j

ωjYj , with ωj > 0 for all j, and ∑

j

ωj = 1 Can we choose the weights to optimize our predictive power?

slide-12
SLIDE 12

Predicting composite outcome

Considering predicting composite outcome Yω with a prediction function ψω(X). A measure of the performance of ψω is R2

0,ω(ψω) = 1 − E0[{Yω − ψω(X)}2]

E0[{Yω − µ0,ω(X)}2] , where µ0,ω = E0(Yω).

slide-13
SLIDE 13

Predicting composite outcome

Easy to show that R2

0,ω is maximized by using

ψ0,ω(X) = E0(Yω | X) = E0 (∑

j

ωjYj

  • X

) = ∑

j

ωjE0(Yj | X) = ∑

j

ωjψ0,j(X) . The best predictor of composite outcome is the weighted combination of the best predictor for each outcome.

slide-14
SLIDE 14

Choosing weights

For each choice of weights, R2

0,ω(ψ0,ω) = 1 − E0[{Yω − ψ0,ω(X)}2]

E0[{Yω − µ0,ω(X)}2] , is the best we could do predicting composite outcome. Now, we choose the weights that maximize R-squared, ω0 = argmaxωR2

0,ω(ψ0,ω)

The statistical goal is to estimate ω0 and ψ0,ω0.

slide-15
SLIDE 15

Simple example

Let X = (X1, . . . , X6) and Y = (Y1, . . . , Y6), with Xd ∼ Uniform(0, 4) for d = 1, . . . , 6 Yj ∼ Normal(0, 25) for j = 1, 2, 3 Yj ∼ Normal(X1 + 2X2 + 4X3 + 2Xj, 25) for j = 4, 5, 6 Y1, Y2, and Y3 are noise X1, X2, X3 predict Y4, Y5, Y6 Xj predicts only Yj for j = 4, 5, 6. Outcome R2 Y1, Y2, Y3 0.00 Y4, Y5, Y6 0.57 Standardized 0.37 Optimal 0.87

slide-16
SLIDE 16

Predicting each outcome

The best predictor of composite outcome is the weighted combination of the best predictor for each outcome. How should be go about estimating a prediction function for each outcome? Linear regression, with interactions, and nonlinear terms, or splines (with different degrees?) Penalized linear regression, with different penalties? Random forests, with different tuning parameters? Gradient boosting? Support vector machines? Deep neural networks? Ad infinitum...

slide-17
SLIDE 17

Predicting each outcome

We have no way of knowing a-priori which algorithm is best. This depends on the truth! Best prediction function might be different across

  • utcomes.

How can we objectively evaluate M algorithms for Yj? Option 1: Train algorithms, see which has best R2

  • verfit!

Option 2: Train algorithms, do new experiment to evaluate expensive! Option 3: Cross validation!

slide-18
SLIDE 18

Cross validation

Consider randomly splitting the data into K different pieces. S1 S2 S3 S4 S5

slide-19
SLIDE 19

Cross validation

Validation (V1 = {i ∈ S1}) and training (T1 = {i / ∈ S1}) V1 T1 T1 T1 T1

slide-20
SLIDE 20

Cross validation

For m = 1, . . . , M, fit algorithms using training sample. For example, Ψm could correspond to a linear regression.

  • 1. Estimate parameters of regression model using

{Oi : i ∈ T1}.

  • 2. Ψm(T1) is now a prediction function

To predict on a new observation x, Ψm(T1)(x) = ˆ β0 + ˆ β1x1 + . . . + ˆ βDxD

slide-21
SLIDE 21

Cross validation

The algorithm Ψm could be more complicated,

  • 1. Estimate parameters of full regression model using

{Oi : i ∈ T1}

  • 2. Use backward selection, eliminating variables with

p-value > 0.1

  • 3. Stop when no more variables are eliminated
  • 4. Ψm(T1) is now a prediction function

To predict on a new observation x, Ψm(T1)(x) = xfinal ˆ βfinal

slide-22
SLIDE 22

Cross validation

The algorithm Ψm could be a machine learning algorithm,

  • 1. Train a support vector machine using {Oi : i ∈ T1}
  • 2. Ψm(T1) is now a “black box” prediction function

To predict on a new observation x, feed x into black box and get prediction back.

slide-23
SLIDE 23

Cross validation

For m = 1, . . . , M, compute error on validation sample Em,1 = 1 |V1| ∑

i∈V1

{Yi − Ψm(T1)(Xi)}2 V1 T1 T1 T1 T1

slide-24
SLIDE 24

Cross validation

For m = 1, . . . , M, compute error on validation sample Em,1 = 1 |V1| ∑

i∈V1

{Yi − Ψm(T1)(Xi)}2 To compute

  • 1. Obtain predictions on validation sample using

algorithms from training sample.

  • 2. Average squared residual for each observation

As though we did another experiment (of size |S1|) to evaluate the algorithms!

slide-25
SLIDE 25

Cross validation

Validation (V2 = {i ∈ S2}) and training (T2 = {i / ∈ S2}) T2 V2 T2 T2 T2

slide-26
SLIDE 26

Cross validation

For m = 1, . . . , M, fit algorithms using training sample. T2 V2 T2 T2 T2

slide-27
SLIDE 27

Cross validation

For m = 1, . . . , M, compute error on validation sample Em,2 = 1 |V2| ∑

i∈V2

{Yi − Ψm(T2)(Xi)}2 T2 V2 T2 T2 T2

slide-28
SLIDE 28

Cross validation

Continue until each split has been validation once. T3 T3 V3 T3 T3

slide-29
SLIDE 29

Cross validation

Continue until each split has been validation once. V4 T4 T4 V4 T4

slide-30
SLIDE 30

Cross validation

Continue until each split has been validation once. T5 T5 T5 T5 V5

slide-31
SLIDE 31

Cross validation selector

The overall performance of algorithm m is ¯ Em = 1 K ∑

k

Em,k the average mean squared-error across splits. At this point, could choose m∗, the algorithm with lowest

  • error. This is called the cross-validation selector.

Prediction function is Ψm∗(F), where F = {1, . . . , n} is the full data.

slide-32
SLIDE 32

Super learner

Alternatively, consider an ensemble prediction function Ψ = ∑

m

αmΨm , αm > 0 for all m , and ∑

m

αm = 1 and choose α that minimizes cross-validated error. Often seen to have superior performance to choosing the single best algorithm. This estimator is referred to as the Super Learner. [3]

slide-33
SLIDE 33

Super learner

For example, linear regression might capture one feature, but support vector machines captures another. The prediction function Ψ(x) = 0.5Ψlinmod(x) + 0.5Ψsvm(X) might be better than Ψlinmod or Ψsvm alone. Computing the best weighting of algorithms is computationally simple after cross validation.

slide-34
SLIDE 34

Oracle inequality

The name Super Learner derives from an important theoretical result called the oracle inequality. For large enough sample size, the Super Learner predicts as well as the (unknown) best algorithm considered. The number of algorithms one may consider is large and allowed to grow with n, e.g., Mn = n2 [4]

slide-35
SLIDE 35

Combined prediction function

We now have a super learner prediction function Ψj(F) for each outcome Yj. For any choice of weights, we have a prediction function for combined outcome: ψn,ω = ∑

j

ωjΨj(F) . Still need to choose weights that maximize predictive performance of combined outcome.

slide-36
SLIDE 36

Choosing optimal weights

How do we go about estimating ω0, the weights that yield the highest R2? Option 1: Get SL predictions, maximize R2 over weights

  • verfit!

Option 2: Do new experiment, maximize R2 on new data expensive! Option 3: Cross validation!

slide-37
SLIDE 37

Cross validation

Randomly split the data into K different pieces. S1 S2 S3 S4 S5

slide-38
SLIDE 38

Cross validation

Training (T1 = {i / ∈ S1}) and validation (V1 = {i ∈ S1}) V1 T1 T1 T1 T1

slide-39
SLIDE 39

Cross validation

Fit the Super Learner on training data, Ψ(T1). V1 T1 T1 T1 T1

slide-40
SLIDE 40

Cross validation

Fit the Super Learner on training data, Ψ(T1). V1 T1 T1 T1 T1

˜ V1 ˜ T1 ˜ T1 ˜ T1 ˜ T1

slide-41
SLIDE 41

Cross validation

Fit the Super Learner on training data, Ψ(T1). V1 T1 T1 T1 T1

˜ T2 ˜ V2 ˜ T2 ˜ T2 ˜ T2

slide-42
SLIDE 42

Cross validation

Fit the Super Learner on training data, Ψ(T1). V1 T1 T1 T1 T1

˜ T3 ˜ T3 ˜ V3 ˜ T3 ˜ T3

slide-43
SLIDE 43

Cross validation

Fit the Super Learner on training data, Ψ(T1). V1 T1 T1 T1 T1

˜ T4 ˜ T4 ˜ T4 ˜ V4 ˜ T4

slide-44
SLIDE 44

Cross validation

Fit the Super Learner on training data, Ψ(T1). V1 T1 T1 T1 T1

˜ T5 ˜ T5 ˜ T5 ˜ T5 ˜ V5

slide-45
SLIDE 45

Cross validation

Fit the Super Learner on training data, Ψ(T2). T2 V2 T2 T2 T2

slide-46
SLIDE 46

Cross validation

Fit the Super Learner on training data, Ψ(T3). T3 T3 V3 T3 T3

slide-47
SLIDE 47

Estimating the weights

Objective performance of the combined super learner for estimating combined outcome: ¯ Eω(Ψω) = 1 |Vk| ∑

i∈Vk

{Yω,i − Ψω(Tk)(Xi)}2 To compute

  • 1. Fit J super learners in each of K splits
  • 2. Compute k-th combined prediction function ∑

j Ψj(Tk)

  • 3. Obtain combined predictions Ψj(Tk)(Xi)
  • 4. Compute combined outcome Yω
  • 5. Compute average squared error in each split
  • 6. Average over splits
slide-48
SLIDE 48

Estimating the weights

For a given ω, the cross-validated R2 is R2

n,ω(Ψω) = 1 −

¯ Eω(Ψω)

1 n

i{Yω,i − ¯

Yω}2 Our estimate of the optimal weights is ωn = argmaxωR2

n,ω(Ψω)

Our estimate of the prediction function is ψn,ωn = ∑

j

ωj,nΨj(F) , where Ψj(F) denotes the j-th super learner fit using all data.

slide-49
SLIDE 49

Estimating predictive performance

Researchers are likely interested in evaluating the performance of ψn,ωn for predicting Yωn. How good is the combined super learner at predicting the combined outcome? To evaluate performance, Option 1. Report R2

n,ωn, call it a day.

  • verfit!

Option 2. Estimate ψn,ωn, do a new experiment, evaluate predictions expensive! Option 3. More cross validation!!!

slide-50
SLIDE 50

Cross validation

Pictures omitted for the sanity of audience.

slide-51
SLIDE 51

Estimating performance

Entire procedure is cross-validated to obtain an honest estimate of the performance of the estimator.

  • 1. Compute ωn and Ψωn in training sample
  • 2. Estimate R2 in validation sample
  • 3. Average over splits

Relatively straightforward to construct (closed form) confidence intervals and hypothesis tests. [5] Simulations show nominal confidence interval coverage with n ≈ 500.

slide-52
SLIDE 52

Variable importance measures

We can define the importance of a variable for prediction by taking difference in R2 with/without the variable. Relatively straightforward to construct (closed form) confidence intervals and hypothesis tests about the estimate. Provides simple inference for the question, “Does measuring Xd improve my ability to predict composite outcome?”

slide-53
SLIDE 53

Cebu Study

Questions:

  • 1. Can a screening tool be constructed using information

available in early childhood to predict neurocognitive deficits later in life?

  • 2. What variables improve prediction of neurocognitive
  • utcomes?
  • 3. Do somatic growth measures improve predictions of

neurocognitive outcomes?

slide-54
SLIDE 54

Cebu results

Super Learner library consisted of elastic net regressions with different tuning parameters to predict each outcome. We used ten folds for each layer of cross-validation. The estimated optimal outcome score was Yωn = 0.71Yenglish + 0.17Ymath + 0.12Ycebuano

slide-55
SLIDE 55

Cebu results

Predictive power was modest, but strongest for the combined score.

Rn

2 (95% CI)

0.00 0.10 0.20 0.30 Cebuano Math Standardized English Optimal

slide-56
SLIDE 56

Cebu results

Variable importance for each variable.

∆−Rn

2 (95% CI)

−0.02 0.02 0.06 SES WAZ − 24 months HAZ − 12 months HAZ − 24 months WAZ − 6 months Child dependency ratio Mother age first birth Income Gestational age Sanitation Clean water Mother parity Mother smoked WAZ − birth Child:adult ratio Crowding index HAZ − 6 months Mother height HAZ − birth Mother age WAZ − 12 months WAZ − 18 months Urban score Health care access Mother marital status HAZ − 18 months Father age Health care utilization Father education Mother education Sex

  • Combined
slide-57
SLIDE 57

Cebu results

Variable importance for groups of variables.

∆−Rn

2 (95% CI)

−0.02 0.02 0.06 SES Gestational age Mother smoked Sanitation Household Healthcare Growth Sex Parental

  • Combined
slide-58
SLIDE 58

Conclusions

How composite outcomes are formed should be informed by scientific goal – when possible, opt for simplicity. Cross validation is a powerful tool for mimicking repeated experiments. Super learner is an objective framework for evaluating and combining different prediction algorithms. Future work for combining outcomes as an exposure and for estimating effects of variables on combined outcomes.

slide-59
SLIDE 59

Software

R packages: SuperLearner [6] Demonstration – http:/ /benkeser.github.io/sllecture/ h2oEnsemble Distributed machine learning algorithms r2weight In development Current version can be downloaded from GitHub: https:/ /github.com/benkeser/r2weight

slide-60
SLIDE 60

References

[1] N L’ntshotsholé Jumbe, Jeffrey C Murray, and Steven Kern. Data sharing and inductive learning – Toward healthy birth, growth, and development. New England Journal of Medicine, 2016. [2] AB Feranil, SA Gultiano, and LS Adair. The Cebu Longitudinal Health and Nutrition Survey: Two Decades Later. Asia-Pacific Population Journal, 23(3), 2008. [3] Mark J van der Laan and Eric C Polley. Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1):1–23, 2007. [4] Mark J van der Laan and Sandrine Dudoit. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. UC Berkeley Division of Biostatistics Working Paper Series, 2003. [5] Alan E Hubbard, Sara Kherad-Pajouh, and Mark J van der Laan. Statistical inference for data adaptive target parameters. The International Journal of Biostatistics, 12(1):3–19, 2016. [6] Eric Polley and Mark van der Laan. SuperLearner: Super Learner Prediction, 2013. R package version 2.0-10.