Super learner with application to Predicting HIV-1 Drug Resistance - - PowerPoint PPT Presentation

super learner with application to predicting hiv 1 drug
SMART_READER_LITE
LIVE PREVIEW

Super learner with application to Predicting HIV-1 Drug Resistance - - PowerPoint PPT Presentation

Super learner with application to Predicting HIV-1 Drug Resistance Beilin Jia University of North Carolina at Chapel Hill April 24, 2018 Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 1 / 20 Overview Super learner


slide-1
SLIDE 1

Super learner with application to Predicting HIV-1 Drug Resistance

Beilin Jia

University of North Carolina at Chapel Hill

April 24, 2018

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 1 / 20

slide-2
SLIDE 2

Overview

1

Super learner

2

Simulation

3

HIV-1 Example

4

Discussion

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 2 / 20

slide-3
SLIDE 3

Super learner

The super learner is a prediction algorithm, which applies a set of candidate learners to the observed data, and chooses the optimal learner for a given prediction problem based on cross-validated risk.

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 3 / 20

slide-4
SLIDE 4

Super learner

Sinisi et al. (2007) Based on unified loss-based estimation theory Three main steps: Define the parameter of interest in terms of a loss function Construct a set of candidate estimators based on loss function Select optimal estimator based on cross-validation. In the paper, candidate learning algorithms include: Least Angle Regression (LARS) Logic Regression Deletion/Substitution/Addition (D/S/A) algorithm Classification and Regression Trees (CART) Ridge Regression Linear Regression Cross-validation selector selects the learner with the best performance

  • n the validation sets.

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 4 / 20

slide-5
SLIDE 5

Super learner

The super learner will asymptotically outperform any of the candidate estimators it uses as long as the number of candidate learners is polynomial in sample size (if one of the candidate estimators it employs achieves a parametric rate of convergence, the super learner will converge at an almost parametric rate). (Sinisi and van der Laan, 2004; Van Der Laan and Dudoit, 2003; Van der Laan et al., 2006; Van der Vaart et al., 2006)

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 5 / 20

slide-6
SLIDE 6

Simulation setup

N=500 yi = 2w1w10 + 4w2w7 + 3w4w5 − 5w6w10 + 3w8w9 + N(0, 1), i = 1, 2, · · · , 500, wj ∼ Bin(0.4), j = 1, · · · , 10. Y : outcome. W : 10-dimensional covariates. 10-fold cross-validation. Internal cross-validation to select the optimal fraction in LARS and the fine-tuning parameters for Logic Regression and D/S/A.

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 6 / 20

slide-7
SLIDE 7

Simulation results

Table 1: Super Learner: Cross-Validated Risks of Candidate Learners (n=500)

Method Median Mean Std Error Linear Regression (1) 4.477 4.414 0.76 Linear Regression (2) 1.182 1.165 0.16 LARS (1) 4.594 4.719 0.92 LARS (2) 1.179 1.183 0.13 Logic Regression 1.026 1.043 0.21 D/S/A 1.026 1.055 0.19 CART 1.773 1.828 0.60 Ridge Regression 1.176 1.157 0.16

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 7 / 20

slide-8
SLIDE 8

Simulation results

Next step: apply Logic Regression to the entire dataset. The final logic tree: −3.09 ∗ ((not w9) or (not w8)) + 4.58 ∗ ((not w10) or (not w6)) + 4.17 ∗ ((not w6) and w6) or (w7 and w2) − 3.09 ∗ ((not w5) or (not w4)) + 0.839 ∗ w1 Due to close competition between Logic Regression and D/S/A, Sinisi et al. (2007) evaluated the performance of two estimators on an independent test set of sample size 5,000 Logic Regression: mean squared prediction error (MSPE) = 1.37, R2 = 0.84. D/S/A: MSPE = 1.05, R2 = 0.88

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 8 / 20

slide-9
SLIDE 9

Simulation results

Sinisi et al. (2007) applied the super learner to a dataset of increasing sample size, n = 100, n = 1000, n = 10, 000. Consider the same set of candidate learners. Estimated cross-validated risks vary less as sample size increases. Both D/S/A and Logic regression converge at a parametric rate to the true model.

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 9 / 20

slide-10
SLIDE 10

Simulation results

Try a different data generating distribution: yi = 2w1w10 + 4w2w7 + 3w4w5 − 5w6w10 + 3w8w9 + w1w2w4 − 2w7(1 − w6)w9 − 4(1 − w10)w1(1 − w4) + N(0, 1), wj ∼ Bin(0, 4), j = 1, 2, · · · , 10 Each candidate learner would not be converging at a parametric rate to the true model. a new candidate learner: convex combination of the other candidate

  • learners. e.g. ˆ

yconvex,α = αˆ yDSA + (1 − α)ˆ yLogic D/S/A is better than Logic Regression in terms of lowest cross-validated risk, but the convex combination (α = 0.8316) of the two models outperforms both.

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 10 / 20

slide-11
SLIDE 11

HIV-1 example

Data from the Stanford HIV Reverse Transcriptase and Protease Sequence Database. (Rhee et al., 2006) Predict viral susceptibility to protease inhibitors (PIs) based on mutations in the protease region of the viral strand. Use non-polymorphic treatment-selected mutations (TSMs) as predictors. 58 TSMs used, occurring at 34 positions in protease. Outcome: standardized log fold change in drug susceptibility. Fold change =

IC50 of an isolate IC50 of a standard wildtype control isolate.

IC50 is the concentration of a drug needed to inhibit viral replication by 50% where IC stands for inhibitory concentration.

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 11 / 20

slide-12
SLIDE 12

HIV-1 example, results

Apply super learner to predicting susceptibility to a single PI, nelfinavir (NFV). 10-fold cross-validation, candidate learners include: LARS, Logic Regression, D/S/A, CART, Ridge Regression, and linear regression. Rhee et al. (2006) found that including all two-way interactions among the mutations as input did not improve the prediction accuracy. Optimal learner: linear regression with all 58 main terms, average risk = 0.187

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 12 / 20

slide-13
SLIDE 13

HIV-1 example, results

D/S/A estimator, average cross-validated risk = 0.188 Apply linear regression and D/S/A to entire dataset. Cross-validation selects a final D/S/A estimator with 40 main terms. Marginally improved by including the other 18 mutations in the prediction model.

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 13 / 20

slide-14
SLIDE 14

HIV example, results

Figure 1: D/S/A Estimator applied to learning sample, size ∈ {1, · · · , 50}

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 14 / 20

slide-15
SLIDE 15

HIV-1 example, results

Figure 2: Linear Regression Model Fit

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 15 / 20

slide-16
SLIDE 16

HIV-1 example, results

Figure 3: D/S/A Estimator: Best Model of Sizes 1 to 20. (i.e., Best model of size 1: L90M, Best Model of Size 2: L90M and 30N, etc.)

p-values for the coefficients from linear regression fit and the list of best D/S/A models of each size imply the importance of each candidate mutation for resistance to NFV. Two models yield quite comparable insight into the set of mutations key to predicting susceptibility to NFV.

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 16 / 20

slide-17
SLIDE 17

Discussion

Instead of considering one learning algorithm to deal with prediction problems, a better approach would be to apply as many candidate learners as are feasible and choose the optimal one. Benefit of using convex combinations as additional candidate learners. In practice, no guarantee that super learner will always select the

  • ptimal learner.

Variability in the estimates of cross-validated risk for each candidate learner clearly depends on the size of the dataset. The candidate learner selected can shift with increasing sample size. Worthwhile to evaluate not only the final optimal estimator but also competitive estimators.

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 17 / 20

slide-18
SLIDE 18

Reference I

Soo-Yon Rhee, Jonathan Taylor, Gauhar Wadhera, Asa Ben-Hur, Douglas L Brutlag, and Robert W Shafer. Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences, 103(46):17355–17360, 2006. Sandra E Sinisi and Mark J van der Laan. Deletion/substitution/addition algorithm in learning with applications in genomics. Statistical applications in genetics and molecular biology, 3(1):1–38, 2004. Sandra E Sinisi, Eric C Polley, Maya L Petersen, Soo-Yon Rhee, and Mark J van der Laan. Super learning: an application to the prediction of hiv-1 drug resistance. Statistical applications in genetics and molecular biology, 6(1), 2007. Mark J Van Der Laan and Sandrine Dudoit. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. 2003.

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 18 / 20

slide-19
SLIDE 19

Reference II

Mark J Van der Laan, Sandrine Dudoit, and Aad W van der Vaart. The cross-validated adaptive epsilon-net estimator. Statistics & Decisions, 24(3):373–395, 2006. Aad W Van der Vaart, Sandrine Dudoit, and Mark J van der Laan. Oracle inequalities for multi-fold cross validation. Statistics & Decisions, 24(3): 351–371, 2006.

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 19 / 20

slide-20
SLIDE 20

Thanks for listening!

Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 20 / 20