SLIDE 1 Optimally Combining Outcomes to Improve Prediction
David Benkeser benkeser@berkeley.edu
November 15, 2016 UC Berkeley Biostatistics
SLIDE 2
Acknowledgments
Collaborators: Mark van der Laan, Alan Hubbard, Ben Arnold, Jack Colford, Andrew Mertens, Oleg Sofyrgin Funding: Bill and Melinda Gates Foundation OPP1147962
SLIDE 3
Motivation
The Gates Foundation’s HBGDki is a program aimed at improving early childhood development [1] A “precision public health” initiative – getting the right child, the right intervention, at the right time.
SLIDE 4
Cebu Study
The Cebu Longitudinal Health and Nutrition Survey enrolled pregnant women between 1983 and 1984. [2] Children were followed every 6 months for two years after birth and again at ages 8,11,16,18,21. Research is focused on the long-term effects of prenatal and early childhood exposures on later outcomes.
SLIDE 5 Cebu Study
Questions:
- 1. Can a screening tool be constructed using information
available in early childhood to predict neurocognitive deficits later in life?
- 2. What variables improve prediction of neurocognitive
- utcomes?
- 3. Do somatic growth measures improve predictions of
neurocognitive outcomes?
SLIDE 6
Observed data
The observed data are n = 2, 166 observations O = (X, Y). X = D covariates Early childhood measurements: health care access, sanitation, parental information, gestational age Y = J outcomes Achievement test scores at 11 years old: Math, English, Cebuano
SLIDE 7 Combining test scores
Reasons to combine test scores into single score:
- 1. No scientific reason to prefer one score
- 2. Predicting deficit in any domain is important
- 3. Avoid multiple comparisons
- 4. Improve prediction?
SLIDE 8 Existing methods
Scores could be standardized and summed: Zi = ∑
j
Yi − ¯ Yj ˆ σj Downsides:
- 1. Somewhat ad-hoc
- 2. Outcomes may not be strongly related to X
SLIDE 9 Existing methods
Principal components/factor analysis of Y Perform some transformation of Y Look at eigenvalues/scree plots/etc... to choose factors Decide on linear combination of factors Downsides:
- 1. Very ad-hoc
- 2. Difficult to interpret/explain
- 3. Outcomes may not be strongly related to X
SLIDE 10 Existing methods
Supervised methods, e.g., canonical correlation, redundancy analysis, partial least squares, etc... Downsides:
- 1. Difficult to interpret
- 2. Outcomes not naturally combined into single score
- 3. Inference not straightforward
SLIDE 11 Combining scores
What are characteristics of a good composite score?
- 1. Simple to interpret
- 2. Reflect the scientific goal (prediction)
- 3. Procedure can be fully pre-specified
Consider a simple weighted combination of outcomes, Yω = ∑
j
ωjYj , with ωj > 0 for all j, and ∑
j
ωj = 1 Can we choose the weights to optimize our predictive power?
SLIDE 12
Predicting composite outcome
Considering predicting composite outcome Yω with a prediction function ψω(X). A measure of the performance of ψω is R2
0,ω(ψω) = 1 − E0[{Yω − ψω(X)}2]
E0[{Yω − µ0,ω(X)}2] , where µ0,ω = E0(Yω).
SLIDE 13 Predicting composite outcome
Easy to show that R2
0,ω is maximized by using
ψ0,ω(X) = E0(Yω | X) = E0 (∑
j
ωjYj
) = ∑
j
ωjE0(Yj | X) = ∑
j
ωjψ0,j(X) . The best predictor of composite outcome is the weighted combination of the best predictor for each outcome.
SLIDE 14
Choosing weights
For each choice of weights, R2
0,ω(ψ0,ω) = 1 − E0[{Yω − ψ0,ω(X)}2]
E0[{Yω − µ0,ω(X)}2] , is the best we could do predicting composite outcome. Now, we choose the weights that maximize R-squared, ω0 = argmaxωR2
0,ω(ψ0,ω)
The statistical goal is to estimate ω0 and ψ0,ω0.
SLIDE 15
Simple example
Let X = (X1, . . . , X6) and Y = (Y1, . . . , Y6), with Xd ∼ Uniform(0, 4) for d = 1, . . . , 6 Yj ∼ Normal(0, 25) for j = 1, 2, 3 Yj ∼ Normal(X1 + 2X2 + 4X3 + 2Xj, 25) for j = 4, 5, 6 Y1, Y2, and Y3 are noise X1, X2, X3 predict Y4, Y5, Y6 Xj predicts only Yj for j = 4, 5, 6. Outcome R2 Y1, Y2, Y3 0.00 Y4, Y5, Y6 0.57 Standardized 0.37 Optimal 0.87
SLIDE 16
Predicting each outcome
The best predictor of composite outcome is the weighted combination of the best predictor for each outcome. How should be go about estimating a prediction function for each outcome? Linear regression, with interactions, and nonlinear terms, or splines (with different degrees?) Penalized linear regression, with different penalties? Random forests, with different tuning parameters? Gradient boosting? Support vector machines? Deep neural networks? Ad infinitum...
SLIDE 17 Predicting each outcome
We have no way of knowing a-priori which algorithm is best. This depends on the truth! Best prediction function might be different across
How can we objectively evaluate M algorithms for Yj? Option 1: Train algorithms, see which has best R2
Option 2: Train algorithms, do new experiment to evaluate expensive! Option 3: Cross validation!
SLIDE 18
Cross validation
Consider randomly splitting the data into K different pieces. S1 S2 S3 S4 S5
SLIDE 19
Cross validation
Validation (V1 = {i ∈ S1}) and training (T1 = {i / ∈ S1}) V1 T1 T1 T1 T1
SLIDE 20 Cross validation
For m = 1, . . . , M, fit algorithms using training sample. For example, Ψm could correspond to a linear regression.
- 1. Estimate parameters of regression model using
{Oi : i ∈ T1}.
- 2. Ψm(T1) is now a prediction function
To predict on a new observation x, Ψm(T1)(x) = ˆ β0 + ˆ β1x1 + . . . + ˆ βDxD
SLIDE 21 Cross validation
The algorithm Ψm could be more complicated,
- 1. Estimate parameters of full regression model using
{Oi : i ∈ T1}
- 2. Use backward selection, eliminating variables with
p-value > 0.1
- 3. Stop when no more variables are eliminated
- 4. Ψm(T1) is now a prediction function
To predict on a new observation x, Ψm(T1)(x) = xfinal ˆ βfinal
SLIDE 22 Cross validation
The algorithm Ψm could be a machine learning algorithm,
- 1. Train a support vector machine using {Oi : i ∈ T1}
- 2. Ψm(T1) is now a “black box” prediction function
To predict on a new observation x, feed x into black box and get prediction back.
SLIDE 23
Cross validation
For m = 1, . . . , M, compute error on validation sample Em,1 = 1 |V1| ∑
i∈V1
{Yi − Ψm(T1)(Xi)}2 V1 T1 T1 T1 T1
SLIDE 24 Cross validation
For m = 1, . . . , M, compute error on validation sample Em,1 = 1 |V1| ∑
i∈V1
{Yi − Ψm(T1)(Xi)}2 To compute
- 1. Obtain predictions on validation sample using
algorithms from training sample.
- 2. Average squared residual for each observation
As though we did another experiment (of size |S1|) to evaluate the algorithms!
SLIDE 25
Cross validation
Validation (V2 = {i ∈ S2}) and training (T2 = {i / ∈ S2}) T2 V2 T2 T2 T2
SLIDE 26
Cross validation
For m = 1, . . . , M, fit algorithms using training sample. T2 V2 T2 T2 T2
SLIDE 27
Cross validation
For m = 1, . . . , M, compute error on validation sample Em,2 = 1 |V2| ∑
i∈V2
{Yi − Ψm(T2)(Xi)}2 T2 V2 T2 T2 T2
SLIDE 28
Cross validation
Continue until each split has been validation once. T3 T3 V3 T3 T3
SLIDE 29
Cross validation
Continue until each split has been validation once. V4 T4 T4 V4 T4
SLIDE 30
Cross validation
Continue until each split has been validation once. T5 T5 T5 T5 V5
SLIDE 31 Cross validation selector
The overall performance of algorithm m is ¯ Em = 1 K ∑
k
Em,k the average mean squared-error across splits. At this point, could choose m∗, the algorithm with lowest
- error. This is called the cross-validation selector.
Prediction function is Ψm∗(F), where F = {1, . . . , n} is the full data.
SLIDE 32
Super learner
Alternatively, consider an ensemble prediction function Ψ = ∑
m
αmΨm , αm > 0 for all m , and ∑
m
αm = 1 and choose α that minimizes cross-validated error. Often seen to have superior performance to choosing the single best algorithm. This estimator is referred to as the Super Learner. [3]
SLIDE 33
Super learner
For example, linear regression might capture one feature, but support vector machines captures another. The prediction function Ψ(x) = 0.5Ψlinmod(x) + 0.5Ψsvm(X) might be better than Ψlinmod or Ψsvm alone. Computing the best weighting of algorithms is computationally simple after cross validation.
SLIDE 34
Oracle inequality
The name Super Learner derives from an important theoretical result called the oracle inequality. For large enough sample size, the Super Learner predicts as well as the (unknown) best algorithm considered. The number of algorithms one may consider is large and allowed to grow with n, e.g., Mn = n2 [4]
SLIDE 35
Combined prediction function
We now have a super learner prediction function Ψj(F) for each outcome Yj. For any choice of weights, we have a prediction function for combined outcome: ψn,ω = ∑
j
ωjΨj(F) . Still need to choose weights that maximize predictive performance of combined outcome.
SLIDE 36 Choosing optimal weights
How do we go about estimating ω0, the weights that yield the highest R2? Option 1: Get SL predictions, maximize R2 over weights
Option 2: Do new experiment, maximize R2 on new data expensive! Option 3: Cross validation!
SLIDE 37
Cross validation
Randomly split the data into K different pieces. S1 S2 S3 S4 S5
SLIDE 38
Cross validation
Training (T1 = {i / ∈ S1}) and validation (V1 = {i ∈ S1}) V1 T1 T1 T1 T1
SLIDE 39
Cross validation
Fit the Super Learner on training data, Ψ(T1). V1 T1 T1 T1 T1
SLIDE 40 Cross validation
Fit the Super Learner on training data, Ψ(T1). V1 T1 T1 T1 T1
˜ V1 ˜ T1 ˜ T1 ˜ T1 ˜ T1
SLIDE 41 Cross validation
Fit the Super Learner on training data, Ψ(T1). V1 T1 T1 T1 T1
˜ T2 ˜ V2 ˜ T2 ˜ T2 ˜ T2
SLIDE 42 Cross validation
Fit the Super Learner on training data, Ψ(T1). V1 T1 T1 T1 T1
˜ T3 ˜ T3 ˜ V3 ˜ T3 ˜ T3
SLIDE 43 Cross validation
Fit the Super Learner on training data, Ψ(T1). V1 T1 T1 T1 T1
˜ T4 ˜ T4 ˜ T4 ˜ V4 ˜ T4
SLIDE 44 Cross validation
Fit the Super Learner on training data, Ψ(T1). V1 T1 T1 T1 T1
˜ T5 ˜ T5 ˜ T5 ˜ T5 ˜ V5
SLIDE 45
Cross validation
Fit the Super Learner on training data, Ψ(T2). T2 V2 T2 T2 T2
SLIDE 46
Cross validation
Fit the Super Learner on training data, Ψ(T3). T3 T3 V3 T3 T3
SLIDE 47 Estimating the weights
Objective performance of the combined super learner for estimating combined outcome: ¯ Eω(Ψω) = 1 |Vk| ∑
i∈Vk
{Yω,i − Ψω(Tk)(Xi)}2 To compute
- 1. Fit J super learners in each of K splits
- 2. Compute k-th combined prediction function ∑
j Ψj(Tk)
- 3. Obtain combined predictions Ψj(Tk)(Xi)
- 4. Compute combined outcome Yω
- 5. Compute average squared error in each split
- 6. Average over splits
SLIDE 48
Estimating the weights
For a given ω, the cross-validated R2 is R2
n,ω(Ψω) = 1 −
¯ Eω(Ψω)
1 n
∑
i{Yω,i − ¯
Yω}2 Our estimate of the optimal weights is ωn = argmaxωR2
n,ω(Ψω)
Our estimate of the prediction function is ψn,ωn = ∑
j
ωj,nΨj(F) , where Ψj(F) denotes the j-th super learner fit using all data.
SLIDE 49 Estimating predictive performance
Researchers are likely interested in evaluating the performance of ψn,ωn for predicting Yωn. How good is the combined super learner at predicting the combined outcome? To evaluate performance, Option 1. Report R2
n,ωn, call it a day.
Option 2. Estimate ψn,ωn, do a new experiment, evaluate predictions expensive! Option 3. More cross validation!!!
SLIDE 50
Cross validation
Pictures omitted for the sanity of audience.
SLIDE 51 Estimating performance
Entire procedure is cross-validated to obtain an honest estimate of the performance of the estimator.
- 1. Compute ωn and Ψωn in training sample
- 2. Estimate R2 in validation sample
- 3. Average over splits
Relatively straightforward to construct (closed form) confidence intervals and hypothesis tests. [5] Simulations show nominal confidence interval coverage with n ≈ 500.
SLIDE 52
Variable importance measures
We can define the importance of a variable for prediction by taking difference in R2 with/without the variable. Relatively straightforward to construct (closed form) confidence intervals and hypothesis tests about the estimate. Provides simple inference for the question, “Does measuring Xd improve my ability to predict composite outcome?”
SLIDE 53 Cebu Study
Questions:
- 1. Can a screening tool be constructed using information
available in early childhood to predict neurocognitive deficits later in life?
- 2. What variables improve prediction of neurocognitive
- utcomes?
- 3. Do somatic growth measures improve predictions of
neurocognitive outcomes?
SLIDE 54
Cebu results
Super Learner library consisted of elastic net regressions with different tuning parameters to predict each outcome. We used ten folds for each layer of cross-validation. The estimated optimal outcome score was Yωn = 0.71Yenglish + 0.17Ymath + 0.12Ycebuano
SLIDE 55 Cebu results
Predictive power was modest, but strongest for the combined score.
Rn
2 (95% CI)
0.00 0.10 0.20 0.30 Cebuano Math Standardized English Optimal
SLIDE 56 Cebu results
Variable importance for each variable.
∆−Rn
2 (95% CI)
−0.02 0.02 0.06 SES WAZ − 24 months HAZ − 12 months HAZ − 24 months WAZ − 6 months Child dependency ratio Mother age first birth Income Gestational age Sanitation Clean water Mother parity Mother smoked WAZ − birth Child:adult ratio Crowding index HAZ − 6 months Mother height HAZ − birth Mother age WAZ − 12 months WAZ − 18 months Urban score Health care access Mother marital status HAZ − 18 months Father age Health care utilization Father education Mother education Sex
SLIDE 57 Cebu results
Variable importance for groups of variables.
∆−Rn
2 (95% CI)
−0.02 0.02 0.06 SES Gestational age Mother smoked Sanitation Household Healthcare Growth Sex Parental
SLIDE 58
Conclusions
How composite outcomes are formed should be informed by scientific goal – when possible, opt for simplicity. Cross validation is a powerful tool for mimicking repeated experiments. Super learner is an objective framework for evaluating and combining different prediction algorithms. Future work for combining outcomes as an exposure and for estimating effects of variables on combined outcomes.
SLIDE 59
Software
R packages: SuperLearner [6] Demonstration – http:/ /benkeser.github.io/sllecture/ h2oEnsemble Distributed machine learning algorithms r2weight In development Current version can be downloaded from GitHub: https:/ /github.com/benkeser/r2weight
SLIDE 60 References
[1] N L’ntshotsholé Jumbe, Jeffrey C Murray, and Steven Kern. Data sharing and inductive learning – Toward healthy birth, growth, and development. New England Journal of Medicine, 2016. [2] AB Feranil, SA Gultiano, and LS Adair. The Cebu Longitudinal Health and Nutrition Survey: Two Decades Later. Asia-Pacific Population Journal, 23(3), 2008. [3] Mark J van der Laan and Eric C Polley. Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1):1–23, 2007. [4] Mark J van der Laan and Sandrine Dudoit. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. UC Berkeley Division of Biostatistics Working Paper Series, 2003. [5] Alan E Hubbard, Sara Kherad-Pajouh, and Mark J van der Laan. Statistical inference for data adaptive target parameters. The International Journal of Biostatistics, 12(1):3–19, 2016. [6] Eric Polley and Mark van der Laan. SuperLearner: Super Learner Prediction, 2013. R package version 2.0-10.