Optimally Combining Outcomes to Improve Prediction David Benkeser - PowerPoint PPT Presentation

Optimally Combining Outcomes to Improve Prediction David Benkeser benkeser@berkeley.edu UC Berkeley Biostatistics November 15, 2016

Acknowledgments Collaborators: Mark van der Laan, Alan Hubbard, Ben Arnold, Jack Colford, Andrew Mertens, Oleg Sofyrgin Funding: Bill and Melinda Gates Foundation OPP1147962

Motivation The Gates Foundation’s HBGDki is a program aimed at improving early childhood development [1] A “precision public health” initiative – getting the right child, the right intervention, at the right time.

Cebu Study The Cebu Longitudinal Health and Nutrition Survey enrolled pregnant women between 1983 and 1984. [2] Children were followed every 6 months for two years after birth and again at ages 8,11,16,18,21. Research is focused on the long-term effects of prenatal and early childhood exposures on later outcomes.

Cebu Study Questions: 1. Can a screening tool be constructed using information available in early childhood to predict neurocognitive deficits later in life? 2. What variables improve prediction of neurocognitive outcomes? 3. Do somatic growth measures improve predictions of neurocognitive outcomes?

Observed data The observed data are n = 2 , 166 observations O = ( X , Y ) . X = D covariates Early childhood measurements: health care access, sanitation, parental information, gestational age Y = J outcomes Achievement test scores at 11 years old: Math, English, Cebuano

Combining test scores Reasons to combine test scores into single score: 1. No scientific reason to prefer one score 2. Predicting deficit in any domain is important 3. Avoid multiple comparisons 4. Improve prediction?

Existing methods Scores could be standardized and summed: Y i − ¯ Y j Z i = ∑ ˆ σ j j Downsides: 1. Somewhat ad-hoc 2. Outcomes may not be strongly related to X

Existing methods Principal components/factor analysis of Y Perform some transformation of Y Look at eigenvalues/scree plots/etc... to choose factors Decide on linear combination of factors Downsides: 1. Very ad-hoc 2. Difficult to interpret/explain 3. Outcomes may not be strongly related to X

Existing methods Supervised methods, e.g., canonical correlation, redundancy analysis, partial least squares, etc... Downsides: 1. Difficult to interpret 2. Outcomes not naturally combined into single score 3. Inference not straightforward

Combining scores What are characteristics of a good composite score? 1. Simple to interpret 2. Reflect the scientific goal (prediction) 3. Procedure can be fully pre-specified Consider a simple weighted combination of outcomes, ∑ ∑ Y ω = ω j Y j , with ω j > 0 for all j , and ω j = 1 j j Can we choose the weights to optimize our predictive power?

Predicting composite outcome Considering predicting composite outcome Y ω with a prediction function ψ ω ( X ) . A measure of the performance of ψ ω is 0 ,ω ( ψ ω ) = 1 − E 0 [ { Y ω − ψ ω ( X ) } 2 ] R 2 E 0 [ { Y ω − µ 0 ,ω ( X ) } 2 ] , where µ 0 ,ω = E 0 ( Y ω ) .

Predicting composite outcome Easy to show that R 2 0 ,ω is maximized by using ψ 0 ,ω ( X ) = E 0 ( Y ω | X ) (∑ � ) = E 0 ω j Y j � X � � j ∑ ω j E 0 ( Y j | X ) = j ∑ ω j ψ 0 , j ( X ) . = j The best predictor of composite outcome is the weighted combination of the best predictor for each outcome.

Choosing weights For each choice of weights, 0 ,ω ( ψ 0 ,ω ) = 1 − E 0 [ { Y ω − ψ 0 ,ω ( X ) } 2 ] R 2 E 0 [ { Y ω − µ 0 ,ω ( X ) } 2 ] , is the best we could do predicting composite outcome. Now, we choose the weights that maximize R -squared, ω 0 = argmax ω R 2 0 ,ω ( ψ 0 ,ω ) The statistical goal is to estimate ω 0 and ψ 0 ,ω 0 .

Simple example Let X = ( X 1 , . . . , X 6 ) and Y = ( Y 1 , . . . , Y 6 ) , with X d ∼ Uniform (0 , 4) for d = 1 , . . . , 6 Y j ∼ Normal (0 , 25) for j = 1 , 2 , 3 Y j ∼ Normal ( X 1 + 2 X 2 + 4 X 3 + 2 X j , 25) for j = 4 , 5 , 6 Y 1 , Y 2 , and Y 3 are noise X 1 , X 2 , X 3 predict Y 4 , Y 5 , Y 6 X j predicts only Y j for j = 4 , 5 , 6 . Outcome R 2 0 Y 1 , Y 2 , Y 3 0.00 Y 4 , Y 5 , Y 6 0.57 Standardized 0.37 Optimal 0.87

Predicting each outcome The best predictor of composite outcome is the weighted combination of the best predictor for each outcome. How should be go about estimating a prediction function for each outcome? Linear regression, with interactions, and nonlinear terms, or splines (with different degrees?) Penalized linear regression, with different penalties? Random forests, with different tuning parameters? Gradient boosting? Support vector machines? Deep neural networks? Ad infinitum...

Predicting each outcome We have no way of knowing a-priori which algorithm is best. This depends on the truth! Best prediction function might be different across outcomes. How can we objectively evaluate M algorithms for Y j ? Option 1: Train algorithms, see which has best R 2 overfit! Option 2: Train algorithms, do new experiment to evaluate expensive! Option 3: Cross validation!

Cross validation Consider randomly splitting the data into K different pieces. S 1 S 2 S 3 S 4 S 5

Cross validation Validation ( V 1 = { i ∈ S 1 } ) and training ( T 1 = { i / ∈ S 1 } ) V 1 T 1 T 1 T 1 T 1

Cross validation For m = 1 , . . . , M , fit algorithms using training sample. For example, Ψ m could correspond to a linear regression. 1. Estimate parameters of regression model using { O i : i ∈ T 1 } . 2. Ψ m ( T 1 ) is now a prediction function To predict on a new observation x , Ψ m ( T 1 )( x ) = ˆ β 0 + ˆ β 1 x 1 + . . . + ˆ β D x D

Cross validation The algorithm Ψ m could be more complicated, 1. Estimate parameters of full regression model using { O i : i ∈ T 1 } 2. Use backward selection, eliminating variables with p-value > 0 . 1 3. Stop when no more variables are eliminated 4. Ψ m ( T 1 ) is now a prediction function To predict on a new observation x , Ψ m ( T 1 )( x ) = x final ˆ β final

Cross validation The algorithm Ψ m could be a machine learning algorithm, 1. Train a support vector machine using { O i : i ∈ T 1 } 2. Ψ m ( T 1 ) is now a “black box” prediction function To predict on a new observation x , feed x into black box and get prediction back.

Cross validation For m = 1 , . . . , M , compute error on validation sample 1 ∑ { Y i − Ψ m ( T 1 )( X i ) } 2 E m , 1 = | V 1 | i ∈ V 1 V 1 T 1 T 1 T 1 T 1

Cross validation For m = 1 , . . . , M , compute error on validation sample 1 ∑ { Y i − Ψ m ( T 1 )( X i ) } 2 E m , 1 = | V 1 | i ∈ V 1 To compute 1. Obtain predictions on validation sample using algorithms from training sample. 2. Average squared residual for each observation As though we did another experiment (of size | S 1 | ) to evaluate the algorithms!

Cross validation Validation ( V 2 = { i ∈ S 2 } ) and training ( T 2 = { i / ∈ S 2 } ) T 2 V 2 T 2 T 2 T 2

Cross validation For m = 1 , . . . , M , fit algorithms using training sample. T 2 V 2 T 2 T 2 T 2

Cross validation For m = 1 , . . . , M , compute error on validation sample 1 ∑ { Y i − Ψ m ( T 2 )( X i ) } 2 E m , 2 = | V 2 | i ∈ V 2 T 2 V 2 T 2 T 2 T 2

Cross validation Continue until each split has been validation once. T 3 T 3 V 3 T 3 T 3

Cross validation Continue until each split has been validation once. V 4 T 4 T 4 V 4 T 4

Cross validation Continue until each split has been validation once. T 5 T 5 T 5 T 5 V 5

Cross validation selector The overall performance of algorithm m is E m = 1 ¯ ∑ E m , k K k the average mean squared-error across splits. At this point, could choose m ∗ , the algorithm with lowest error. This is called the cross-validation selector. Prediction function is Ψ m ∗ ( F ) , where F = { 1 , . . . , n } is the full data.

Super learner Alternatively, consider an ensemble prediction function ∑ ∑ α m Ψ m , α m > 0 for all m , and Ψ = α m = 1 m m and choose α that minimizes cross-validated error. Often seen to have superior performance to choosing the single best algorithm. This estimator is referred to as the Super Learner. [3]

Super learner For example, linear regression might capture one feature, but support vector machines captures another. The prediction function Ψ( x ) = 0 . 5Ψ linmod ( x ) + 0 . 5Ψ svm ( X ) might be better than Ψ linmod or Ψ svm alone. Computing the best weighting of algorithms is computationally simple after cross validation.

Oracle inequality The name Super Learner derives from an important theoretical result called the oracle inequality. For large enough sample size, the Super Learner predicts as well as the (unknown) best algorithm considered. The number of algorithms one may consider is large and allowed to grow with n , e.g., M n = n 2 [4]

Combined prediction function We now have a super learner prediction function Ψ j ( F ) for each outcome Y j . For any choice of weights, we have a prediction function for combined outcome: ∑ ω j Ψ j ( F ) . ψ n ,ω = j Still need to choose weights that maximize predictive performance of combined outcome.

Choosing optimal weights How do we go about estimating ω 0 , the weights that yield the highest R 2 ? Option 1: Get SL predictions, maximize R 2 over weights overfit! Option 2: Do new experiment, maximize R 2 on new data expensive! Option 3: Cross validation!

Optimally Combining Outcomes to Improve Prediction David Benkeser - PowerPoint PPT Presentation

Optimally Combining Outcomes to Improve Prediction David Benkeser benkeser@berkeley.edu UC Berkeley Biostatistics November 15, 2016 Acknowledgments Collaborators: Mark van der Laan, Alan Hubbard, Ben Arnold, Jack Colford, Andrew Mertens,

Optimally Propagating SAT Encodings Martin Brain, Liana Hadarean , Ruben Martins and Daniel

Optimally Combining the Hanford Interferometer Strain Channels Albert Lazzarini LIGO Laboratory

Estimation of Optimally-Combined-Biomarker Accuracy in the Absence of a Gold-Standard Reference

Optimal Predition Ma rk ets with Optimal Pla y ers Lea rn Optimally fo r Log Loss

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

DISC- Improv to Improve DISC- Improv to Improve DISC- Improv to Improve DISC- Improv to Improve

Combining Models Oliver Schulte - CMPT 726 Bishop PRML Ch. 14 Combining Models: Some Theory

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

COMBINING PSYCHOTHERAPY AND PSYCHOPHARMACOLOGY: RATIONALE, APPLICATION, AND OUTCOMES Learning

Bottom-up Estimation and Top-down Prediction for Multi-level Models: Solar Energy Prediction

Data Dependence in Data Dependence in Combining Classifiers Combining Classifiers Mohamed

Predicting Patient Outcomes During and After Hospitalization Using AI Aziz Nazha, MD Director,

= 5.5 = 25 = 127

Modelling Wind Farm Data and the Short Term Prediction of Wind Speeds An Investigation into Wind

Context and Intention-Awareness in POIs Recommender Systems 1 Hernani Costa 1 Barbara Furtado 2

Comparison of tongue contour extraction methods from ultrasound images for use in Text-To-Speech

S9932: LEARNING TO BOOST S9932: LEARNING TO BOOST ROBUSTNESS FOR ROBUSTNESS FOR AUTONOMOUS

Lecture 8: High Dimensionality & PCA CS109A Introduction to Data Science Pavlos Protopapas

What should be the role of means- testing in state pensions? Part of the Shaping a stable pensions

Optimally Combining Outcomes to Improve Prediction David Benkeser - PowerPoint PPT Presentation

Optimally Combining Outcomes to Improve Prediction David Benkeser benkeser@berkeley.edu UC Berkeley Biostatistics November 15, 2016 Acknowledgments Collaborators: Mark van der Laan, Alan Hubbard, Ben Arnold, Jack Colford, Andrew Mertens,

Optimally Propagating SAT Encodings Martin Brain, Liana Hadarean , Ruben Martins and Daniel

Optimally Combining the Hanford Interferometer Strain Channels Albert Lazzarini LIGO Laboratory

Estimation of Optimally-Combined-Biomarker Accuracy in the Absence of a Gold-Standard Reference

Optimal Predition Ma rk ets with Optimal Pla y ers Lea rn Optimally fo r Log Loss

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

DISC- Improv to Improve DISC- Improv to Improve DISC- Improv to Improve DISC- Improv to Improve

Combining Models Oliver Schulte - CMPT 726 Bishop PRML Ch. 14 Combining Models: Some Theory

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

COMBINING PSYCHOTHERAPY AND PSYCHOPHARMACOLOGY: RATIONALE, APPLICATION, AND OUTCOMES Learning

Bottom-up Estimation and Top-down Prediction for Multi-level Models: Solar Energy Prediction

Data Dependence in Data Dependence in Combining Classifiers Combining Classifiers Mohamed

Predicting Patient Outcomes During and After Hospitalization Using AI Aziz Nazha, MD Director,

= 5.5 = 25 = 127

Modelling Wind Farm Data and the Short Term Prediction of Wind Speeds An Investigation into Wind

Context and Intention-Awareness in POIs Recommender Systems 1 Hernani Costa 1 Barbara Furtado 2

Comparison of tongue contour extraction methods from ultrasound images for use in Text-To-Speech

S9932: LEARNING TO BOOST S9932: LEARNING TO BOOST ROBUSTNESS FOR ROBUSTNESS FOR AUTONOMOUS

Lecture 8: High Dimensionality &amp; PCA CS109A Introduction to Data Science Pavlos Protopapas

What should be the role of means- testing in state pensions? Part of the Shaping a stable pensions

Lecture 8: High Dimensionality & PCA CS109A Introduction to Data Science Pavlos Protopapas