Introduction to Super Learning Ted Westling, PhD Postdoctoral - - PowerPoint PPT Presentation

introduction to super learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Super Learning Ted Westling, PhD Postdoctoral - - PowerPoint PPT Presentation

Introduction to Super Learning Ted Westling, PhD Postdoctoral Researcher Center for Causal Inference Perelman School of Medicine University of Pennsylvania September 25, 2018 1 / 48 Learning Goals Conceptual understanding of Super


slide-1
SLIDE 1

Introduction to Super Learning

Ted Westling, PhD Postdoctoral Researcher Center for Causal Inference Perelman School of Medicine University of Pennsylvania September 25, 2018

1 / 48

slide-2
SLIDE 2

Learning Goals

  • Conceptual understanding of Super Learning (SL)

2 / 48

slide-3
SLIDE 3

Learning Goals

  • Conceptual understanding of Super Learning (SL)
  • Comfort with the SuperLearner R package

2 / 48

slide-4
SLIDE 4

Learning Goals

  • Conceptual understanding of Super Learning (SL)
  • Comfort with the SuperLearner R package
  • Awareness of the mathematical backbone of SL

2 / 48

slide-5
SLIDE 5

Outline

  • I. Motivation and description of SL (30 minutes)
  • II. Lab 1: Vanilla SL for a continuous outcome (30 minutes)
  • III. Mathematical presentation of SL (20 minutes)
  • IV. Lab 2: Vanilla SL for a binary outcome (30 minutes)

15 minute break

3 / 48

slide-6
SLIDE 6

Outline

15 minute break

  • V. Bells and whistles: Screens, weights, and CV-SL (30

minutes)

  • VI. Lab 3: Binary outcome redux (40 minutes)
  • VII. Lab 4: Case-control analysis of Fluzone vaccine (30

minutes)

4 / 48

slide-7
SLIDE 7
  • I. Motivation and

description of Super Learning

4 / 48

slide-8
SLIDE 8

Notation

  • Y is a univariate outcome

5 / 48

slide-9
SLIDE 9

Notation

  • Y is a univariate outcome
  • X is a p-variate set of predictors

5 / 48

slide-10
SLIDE 10

Notation

  • Y is a univariate outcome
  • X is a p-variate set of predictors
  • We observe n independent copies

(Y1, X1), . . . , (Yn, Xn) from the joint distribution of (Y, X).

5 / 48

slide-11
SLIDE 11

The problem

  • We want to estimate a function, e.g.:

6 / 48

slide-12
SLIDE 12

The problem

  • We want to estimate a function, e.g.:

– Conditional mean (regression) function

6 / 48

slide-13
SLIDE 13

The problem

  • We want to estimate a function, e.g.:

– Conditional mean (regression) function – Conditional quantile function

6 / 48

slide-14
SLIDE 14

The problem

  • We want to estimate a function, e.g.:

– Conditional mean (regression) function – Conditional quantile function – Conditional density function

6 / 48

slide-15
SLIDE 15

The problem

  • We want to estimate a function, e.g.:

– Conditional mean (regression) function – Conditional quantile function – Conditional density function – Conditional hazard function

6 / 48

slide-16
SLIDE 16

The problem

  • We want to estimate a function, e.g.:

– Conditional mean (regression) function – Conditional quantile function – Conditional density function – Conditional hazard function

  • Super Learning can be applied in all of the above settings

6 / 48

slide-17
SLIDE 17

The problem

  • We want to estimate a function, e.g.:

– Conditional mean (regression) function – Conditional quantile function – Conditional density function – Conditional hazard function

  • Super Learning can be applied in all of the above settings
  • We will focus on estimating the regression function

µ(x) := E[Y | X = x].

6 / 48

slide-18
SLIDE 18

Why?

  • 1. Exploratory analysis

7 / 48

slide-19
SLIDE 19

Why?

  • 1. Exploratory analysis
  • 2. Imputation of missing values

7 / 48

slide-20
SLIDE 20

Why?

  • 1. Exploratory analysis
  • 2. Imputation of missing values
  • 3. Prediction for new observations

7 / 48

slide-21
SLIDE 21

Why?

  • 1. Exploratory analysis
  • 2. Imputation of missing values
  • 3. Prediction for new observations
  • 4. Assessing prediction quality/comparing competing

estimators

7 / 48

slide-22
SLIDE 22

Why?

  • 1. Exploratory analysis
  • 2. Imputation of missing values
  • 3. Prediction for new observations
  • 4. Assessing prediction quality/comparing competing

estimators

  • 5. Use as a nuisance parameter estimator

7 / 48

slide-23
SLIDE 23

Why?

  • 1. Exploratory analysis
  • 2. Imputation of missing values
  • 3. Prediction for new observations
  • 4. Assessing prediction quality/comparing competing

estimators

  • 5. Use as a nuisance parameter estimator
  • 6. Confirmatory analysis/hypothesis testing

7 / 48

slide-24
SLIDE 24

Why?

  • 1. Exploratory analysis
  • 2. Imputation of missing values
  • 3. Prediction for new observations
  • 4. Assessing prediction quality/comparing competing

estimators

  • 5. Use as a nuisance parameter estimator
  • 6. Confirmatory analysis/hypothesis testing

(not our goal here)

7 / 48

slide-25
SLIDE 25

We want to estimate µ(x) = E[Y | X = x]. How should we do it?

8 / 48

slide-26
SLIDE 26

We want to estimate µ(x) = E[Y | X = x]. How should we do it?

8 / 48

slide-27
SLIDE 27

We want to estimate µ(x) = E[Y | X = x]. How should we do it?

GAM

8 / 48

slide-28
SLIDE 28

We want to estimate µ(x) = E[Y | X = x]. How should we do it?

GAM

Random Forest

8 / 48

slide-29
SLIDE 29

We want to estimate µ(x) = E[Y | X = x]. How should we do it?

GAM

Random Forest

Neural network

8 / 48

slide-30
SLIDE 30

We want to estimate µ(x) = E[Y | X = x]. How should we do it?

GAM

Random Forest

Neural network GLM

8 / 48

slide-31
SLIDE 31

How do we choose which algorithm to use?

9 / 48

slide-32
SLIDE 32

Super Learning is: An ensemble method for combining predictions from many candidate machine learning algorithms

10 / 48

slide-33
SLIDE 33

Measuring algorithm performance

  • Suppose ˆ

µ1, . . . , ˆ µK are candidate estimators of µ.

11 / 48

slide-34
SLIDE 34

Measuring algorithm performance

  • Suppose ˆ

µ1, . . . , ˆ µK are candidate estimators of µ.

  • k will always index estimators, and i will always index
  • bservations (e.g. study participants)

11 / 48

slide-35
SLIDE 35

Measuring algorithm performance

  • Suppose ˆ

µ1, . . . , ˆ µK are candidate estimators of µ.

  • k will always index estimators, and i will always index
  • bservations (e.g. study participants)
  • The mean squared error of ˆ

µk, MSE(ˆ µk) = E

  • (Y − ˆ

µk(X))2 measures the performance of ˆ µk as an estimator of µ.

11 / 48

slide-36
SLIDE 36

Measuring algorithm performance

  • Suppose ˆ

µ1, . . . , ˆ µK are candidate estimators of µ.

  • k will always index estimators, and i will always index
  • bservations (e.g. study participants)
  • The mean squared error of ˆ

µk, MSE(ˆ µk) = E

  • (Y − ˆ

µk(X))2 measures the performance of ˆ µk as an estimator of µ.

  • If we knew MSE(ˆ

µk), we could choose the ˆ µk with the smallest MSE(ˆ µk).

11 / 48

slide-37
SLIDE 37

Estimating MSE

MSE(ˆ µk) = E

  • (Y − ˆ

µk(X))2

12 / 48

slide-38
SLIDE 38

Estimating MSE

MSE(ˆ µk) = E

  • (Y − ˆ

µk(X))2

  • It is tempting to take

MSE(ˆ µk) = 1

n

n

i=1 [Yi − ˆ

µk(Xi)]2.

12 / 48

slide-39
SLIDE 39

Estimating MSE

MSE(ˆ µk) = E

  • (Y − ˆ

µk(X))2

  • It is tempting to take

MSE(ˆ µk) = 1

n

n

i=1 [Yi − ˆ

µk(Xi)]2.

  • This estimator will favor ˆ

µk which are overfit, because ˆ µk are trained on the same data used to evaluate the MSE.

12 / 48

slide-40
SLIDE 40

Estimating MSE

MSE(ˆ µk) = E

  • (Y − ˆ

µk(X))2

  • It is tempting to take

MSE(ˆ µk) = 1

n

n

i=1 [Yi − ˆ

µk(Xi)]2.

  • This estimator will favor ˆ

µk which are overfit, because ˆ µk are trained on the same data used to evaluate the MSE.

  • Analogy: a student has the exam questions before taking

the exam!

12 / 48

slide-41
SLIDE 41

Estimating MSE

MSE(ˆ µk) = E

  • (Y − ˆ

µk(X))2

  • It is tempting to take

MSE(ˆ µk) = 1

n

n

i=1 [Yi − ˆ

µk(Xi)]2.

  • This estimator will favor ˆ

µk which are overfit, because ˆ µk are trained on the same data used to evaluate the MSE.

  • Analogy: a student has the exam questions before taking

the exam!

  • Instead, we estimate MSE using cross-validation.

12 / 48

slide-42
SLIDE 42

Cross-validation

  • 1. Split the data in to V “folds” of size roughly n/V.

13 / 48

slide-43
SLIDE 43

Cross-validation

  • 1. Split the data in to V “folds” of size roughly n/V.
  • 2. For each fold v = 1, . . . , V:
  • the data in folds other than v is called the training set;
  • the data in fold v is called the test/validation set.

13 / 48

slide-44
SLIDE 44

1 2 3 4 5 6 7 8 9 10 1 Fold 1 1 2 3 4 5 6 7 8 9 10 2 Fold 2 1 2 3 4 5 6 7 8 9 10 3 Fold 3 1 2 3 4 5 6 7 8 9 10 4 Fold 4 1 2 3 4 5 6 7 8 9 10 5 Fold 5 1 2 3 4 5 6 7 8 9 10 6 Fold 6 1 2 3 4 5 6 7 8 9 10 7 Fold 7 1 2 3 4 5 6 7 8 9 10 8 Fold 8 1 2 3 4 5 6 7 8 9 10 9 Fold 9 1 2 3 4 5 6 7 8 9 10 10 Fold 10 Schematic of 10-fold cross-validation. Gray: training sets. Yellow: validation sets. 14 / 48

slide-45
SLIDE 45

Cross-validation

  • 1. Split the data in to V “folds” of size roughly n/V.
  • 2. For each fold v = 1, . . . , V:
  • the data in folds other than v is called the training set;
  • the data in fold v is called the test/validation set.

15 / 48

slide-46
SLIDE 46

Cross-validation

  • 1. Split the data in to V “folds” of size roughly n/V.
  • 2. For each fold v = 1, . . . , V:
  • the data in folds other than v is called the training set;
  • the data in fold v is called the test/validation set.
  • we obtain ˆ

µk,v using the training set;

15 / 48

slide-47
SLIDE 47

Cross-validation

  • 1. Split the data in to V “folds” of size roughly n/V.
  • 2. For each fold v = 1, . . . , V:
  • the data in folds other than v is called the training set;
  • the data in fold v is called the test/validation set.
  • we obtain ˆ

µk,v using the training set;

  • we obtain ˆ

µk,v(Xi) for Xi in the validation set Vv.

15 / 48

slide-48
SLIDE 48

Cross-validation

  • 1. Split the data in to V “folds” of size roughly n/V.
  • 2. For each fold v = 1, . . . , V:
  • the data in folds other than v is called the training set;
  • the data in fold v is called the test/validation set.
  • we obtain ˆ

µk,v using the training set;

  • we obtain ˆ

µk,v(Xi) for Xi in the validation set Vv.

  • 3. Our cross-validated MSE is
  • MSECV(ˆ

µk) = 1 V

V

  • v=1

1 |Vv|

  • i∈Vv

[Yi − ˆ µk,v(Xi)]2.

15 / 48

slide-49
SLIDE 49

Cross-validation

  • 1. Split the data in to V “folds” of size roughly n/V.
  • 2. For each fold v = 1, . . . , V:
  • the data in folds other than v is called the training set;
  • the data in fold v is called the test/validation set.
  • we obtain ˆ

µk,v using the training set;

  • we obtain ˆ

µk,v(Xi) for Xi in the validation set Vv.

  • 3. Our cross-validated MSE is
  • MSECV(ˆ

µk) = 1 V

V

  • v=1

1 |Vv|

  • i∈Vv

[Yi − ˆ µk,v(Xi)]2. We average the MSEs of the V validation sets.

15 / 48

slide-50
SLIDE 50

1 2 3 4 5 6 7 8 9 10 1 Fold 1 1 2 3 4 5 6 7 8 9 10 2 Fold 2 1 2 3 4 5 6 7 8 9 10 3 Fold 3 1 2 3 4 5 6 7 8 9 10 4 Fold 4 1 2 3 4 5 6 7 8 9 10 5 Fold 5 1 2 3 4 5 6 7 8 9 10 6 Fold 6 1 2 3 4 5 6 7 8 9 10 7 Fold 7 1 2 3 4 5 6 7 8 9 10 8 Fold 8 1 2 3 4 5 6 7 8 9 10 9 Fold 9 1 2 3 4 5 6 7 8 9 10 10 Fold 10 1 2 3 4 5 6 7 8 9 10 CV preds. Schematic of 10-fold cross-validation. Gray: training sets. Yellow: validation sets. 16 / 48

slide-51
SLIDE 51

How do we choose V?

  • Large V:

17 / 48

slide-52
SLIDE 52

How do we choose V?

  • Large V:

– more training data, so better for small n

17 / 48

slide-53
SLIDE 53

How do we choose V?

  • Large V:

– more training data, so better for small n – more computation time

17 / 48

slide-54
SLIDE 54

How do we choose V?

  • Large V:

– more training data, so better for small n – more computation time – well-suited to high-dimensional covariates

17 / 48

slide-55
SLIDE 55

How do we choose V?

  • Large V:

– more training data, so better for small n – more computation time – well-suited to high-dimensional covariates – well-suited to complicated or non-smooth µ

17 / 48

slide-56
SLIDE 56

How do we choose V?

  • Large V:

– more training data, so better for small n – more computation time – well-suited to high-dimensional covariates – well-suited to complicated or non-smooth µ

  • Small V:

17 / 48

slide-57
SLIDE 57

How do we choose V?

  • Large V:

– more training data, so better for small n – more computation time – well-suited to high-dimensional covariates – well-suited to complicated or non-smooth µ

  • Small V:

– more test data

17 / 48

slide-58
SLIDE 58

How do we choose V?

  • Large V:

– more training data, so better for small n – more computation time – well-suited to high-dimensional covariates – well-suited to complicated or non-smooth µ

  • Small V:

– more test data – less computation time.

17 / 48

slide-59
SLIDE 59

How do we choose V?

  • Large V:

– more training data, so better for small n – more computation time – well-suited to high-dimensional covariates – well-suited to complicated or non-smooth µ

  • Small V:

– more test data – less computation time. (People typically use V = 5 or V = 10.)

17 / 48

slide-60
SLIDE 60

“Discrete” Super Learner

  • At this point, we have cross-validated MSE estimates
  • MSECV(ˆ

µ1), . . . , MSECV(ˆ µK) for each of our candidate algorithms.

18 / 48

slide-61
SLIDE 61

“Discrete” Super Learner

  • At this point, we have cross-validated MSE estimates
  • MSECV(ˆ

µ1), . . . , MSECV(ˆ µK) for each of our candidate algorithms.

  • We could simply take as our estimator the ˆ

µk minimizing these cross-validated MSEs.

18 / 48

slide-62
SLIDE 62

“Discrete” Super Learner

  • At this point, we have cross-validated MSE estimates
  • MSECV(ˆ

µ1), . . . , MSECV(ˆ µK) for each of our candidate algorithms.

  • We could simply take as our estimator the ˆ

µk minimizing these cross-validated MSEs.

  • We call this the “discrete Super Learner”.

18 / 48

slide-63
SLIDE 63

Super Learner

  • Let λ = (λ1, . . . , λK) be an element of SK, the

K-dimensional simplex: each λk ∈ [0, 1] and

k λk = 1.

19 / 48

slide-64
SLIDE 64

Super Learner

  • Let λ = (λ1, . . . , λK) be an element of SK, the

K-dimensional simplex: each λk ∈ [0, 1] and

k λk = 1.

  • Super Learner considers as its set of candidate algorithms

all convex combinations ˆ µλ := K

k=1 λk ˆ

µk.

19 / 48

slide-65
SLIDE 65

Super Learner

  • Let λ = (λ1, . . . , λK) be an element of SK, the

K-dimensional simplex: each λk ∈ [0, 1] and

k λk = 1.

  • Super Learner considers as its set of candidate algorithms

all convex combinations ˆ µλ := K

k=1 λk ˆ

µk.

  • The Super Learner is ˆ

µ

λ, where

  • λ := arg min

λ∈SK

  • MSECV

K

  • k=1

λk ˆ µk

  • .

(We use constrained optimization to compute the argmin.)

19 / 48

slide-66
SLIDE 66

Super Learner

  • λ := arg min

λ∈SK

  • MSECV

K

  • k=1

λk ˆ µk

  • .

20 / 48

slide-67
SLIDE 67

Super Learner

  • λ := arg min

λ∈SK

  • MSECV

K

  • k=1

λk ˆ µk

  • .
  • MSECV

K

  • k=1

λk ˆ µk

  • = 1

V

V

  • v=1

1 |Vv|

  • i∈Vv
  • Yi −

K

  • k=1

λk ˆ µk,v(Xi) 2 .

20 / 48

slide-68
SLIDE 68

Super Learner

  • λ := arg min

λ∈SK

  • MSECV

K

  • k=1

λk ˆ µk

  • .
  • MSECV

K

  • k=1

λk ˆ µk

  • = 1

V

V

  • v=1

1 |Vv|

  • i∈Vv
  • Yi −

K

  • k=1

λk ˆ µk,v(Xi) 2 .

20 / 48

slide-69
SLIDE 69

Super Learner: steps

Putting it all together:

21 / 48

slide-70
SLIDE 70

Super Learner: steps

Putting it all together:

  • 1. Define a library of candidate algorithms ˆ

µ1, . . . , ˆ µK.

21 / 48

slide-71
SLIDE 71

Super Learner: steps

Putting it all together:

  • 1. Define a library of candidate algorithms ˆ

µ1, . . . , ˆ µK.

  • 2. Obtain the CV-predictions ˆ

µk,v(Xi) for all k, v and i ∈ Vv.

21 / 48

slide-72
SLIDE 72

Super Learner: steps

Putting it all together:

  • 1. Define a library of candidate algorithms ˆ

µ1, . . . , ˆ µK.

  • 2. Obtain the CV-predictions ˆ

µk,v(Xi) for all k, v and i ∈ Vv.

  • 3. Use constrained optimization to compute the SL weights
  • λ := arg minλ∈SK

MSECV K

k=1 λk ˆ

µk

  • .

21 / 48

slide-73
SLIDE 73

Super Learner: steps

Putting it all together:

  • 1. Define a library of candidate algorithms ˆ

µ1, . . . , ˆ µK.

  • 2. Obtain the CV-predictions ˆ

µk,v(Xi) for all k, v and i ∈ Vv.

  • 3. Use constrained optimization to compute the SL weights
  • λ := arg minλ∈SK

MSECV K

k=1 λk ˆ

µk

  • .
  • 4. Take ˆ

µSL = K

k=1

λk ˆ µk.

21 / 48

slide-74
SLIDE 74
  • II. Lab 1:

Vanilla SL for a continuous outcome

21 / 48

slide-75
SLIDE 75
  • III. Into the weeds:

a mathematical presentation of SL

21 / 48

slide-76
SLIDE 76

Review

Recall the construction of SL for a continuous outcome:

22 / 48

slide-77
SLIDE 77

Review

Recall the construction of SL for a continuous outcome:

  • 1. Define a library of candidate algorithms ˆ

µ1, . . . , ˆ µK.

  • 2. Obtain the CV-predictions ˆ

µk,v(Xi) for all k, v and i ∈ Vv.

  • 3. Use constrained optimization to compute the SL weights
  • λ := arg minλ∈SK

MSECV K

k=1 λk ˆ

µk

  • .
  • 4. Take ˆ

µSL = K

k=1

λk ˆ µk.

22 / 48

slide-78
SLIDE 78

In this section, we generalize this procedure to estimation

  • f any summary of the observed data distribution given an

appropriate loss for the summary of interest.

23 / 48

slide-79
SLIDE 79

Loss and risk: setup

  • Denote by O the observed data unit – e.g. O = (Y, X).

24 / 48

slide-80
SLIDE 80

Loss and risk: setup

  • Denote by O the observed data unit – e.g. O = (Y, X).
  • Denote by O the sample space of O

24 / 48

slide-81
SLIDE 81

Loss and risk: setup

  • Denote by O the observed data unit – e.g. O = (Y, X).
  • Denote by O the sample space of O
  • Let M denote our statistical model.

24 / 48

slide-82
SLIDE 82

Loss and risk: setup

  • Denote by O the observed data unit – e.g. O = (Y, X).
  • Denote by O the sample space of O
  • Let M denote our statistical model.
  • Denote by P0 ∈ M the true distribution of O.

24 / 48

slide-83
SLIDE 83

Loss and risk: setup

  • Denote by O the observed data unit – e.g. O = (Y, X).
  • Denote by O the sample space of O
  • Let M denote our statistical model.
  • Denote by P0 ∈ M the true distribution of O.
  • Thus, we observe i.i.d. copies O1, . . . , On ∼ P0.

24 / 48

slide-84
SLIDE 84

Loss and risk: setup

  • Denote by O the observed data unit – e.g. O = (Y, X).
  • Denote by O the sample space of O
  • Let M denote our statistical model.
  • Denote by P0 ∈ M the true distribution of O.
  • Thus, we observe i.i.d. copies O1, . . . , On ∼ P0.
  • Suppose we want to estimate a parameter θ : M → Θ.

24 / 48

slide-85
SLIDE 85

Loss and risk: setup

  • Denote by O the observed data unit – e.g. O = (Y, X).
  • Denote by O the sample space of O
  • Let M denote our statistical model.
  • Denote by P0 ∈ M the true distribution of O.
  • Thus, we observe i.i.d. copies O1, . . . , On ∼ P0.
  • Suppose we want to estimate a parameter θ : M → Θ.
  • Denote θ0 := θ(P0) the true parameter value.

24 / 48

slide-86
SLIDE 86

Loss and risk

  • Let L be a map from O × Θ to R.

25 / 48

slide-87
SLIDE 87

Loss and risk

  • Let L be a map from O × Θ to R.
  • We call L a loss function for θ if it holds that

θ0 = arg min

θ∈Θ

EP0 [L(O, θ)] .

25 / 48

slide-88
SLIDE 88

Loss and risk

  • Let L be a map from O × Θ to R.
  • We call L a loss function for θ if it holds that

θ0 = arg min

θ∈Θ

EP0 [L(O, θ)] .

  • R0(θ) = EP0 [L(O, θ)] is called the oracle risk.

25 / 48

slide-89
SLIDE 89

Loss and risk

  • Let L be a map from O × Θ to R.
  • We call L a loss function for θ if it holds that

θ0 = arg min

θ∈Θ

EP0 [L(O, θ)] .

  • R0(θ) = EP0 [L(O, θ)] is called the oracle risk.
  • These definitions of loss and risk come from the statistical

learning literature (see, e.g. Vapnik, 1992, 1999, 2013) and are not to be confused with loss and risk from the decision theory literature (e.g. Ferguson, 2014).

25 / 48

slide-90
SLIDE 90

Loss and risk: MSE example

MSE is the oracle risk corresponding to a squared-error loss function

26 / 48

slide-91
SLIDE 91

Loss and risk: MSE example

MSE is the oracle risk corresponding to a squared-error loss function

  • O = (Y, X).

26 / 48

slide-92
SLIDE 92

Loss and risk: MSE example

MSE is the oracle risk corresponding to a squared-error loss function

  • O = (Y, X).
  • θ(P) = µ(P) = {x → EP[Y | X = x]}

26 / 48

slide-93
SLIDE 93

Loss and risk: MSE example

MSE is the oracle risk corresponding to a squared-error loss function

  • O = (Y, X).
  • θ(P) = µ(P) = {x → EP[Y | X = x]}
  • L(O, µ) = [Y − µ(X)]2 is the squared-error loss.

26 / 48

slide-94
SLIDE 94

Loss and risk: MSE example

MSE is the oracle risk corresponding to a squared-error loss function

  • O = (Y, X).
  • θ(P) = µ(P) = {x → EP[Y | X = x]}
  • L(O, µ) = [Y − µ(X)]2 is the squared-error loss.
  • R0(µ) = MSE(µ) = EP0[Y − µ(X)]2.

26 / 48

slide-95
SLIDE 95

Estimating the oracle risk

θ0 = arg min

θ∈Θ

R0(θ) R0(θ) = EP0[L(O, θ)]

27 / 48

slide-96
SLIDE 96

Estimating the oracle risk

θ0 = arg min

θ∈Θ

R0(θ) R0(θ) = EP0[L(O, θ)]

  • Suppose that ˆ

θ1, . . . , ˆ θK are candidate estimators.

27 / 48

slide-97
SLIDE 97

Estimating the oracle risk

θ0 = arg min

θ∈Θ

R0(θ) R0(θ) = EP0[L(O, θ)]

  • Suppose that ˆ

θ1, . . . , ˆ θK are candidate estimators.

  • As before, we need to estimate R0(θ) to evaluate each ˆ

θk.

27 / 48

slide-98
SLIDE 98

Estimating the oracle risk

θ0 = arg min

θ∈Θ

R0(θ) R0(θ) = EP0[L(O, θ)]

  • Suppose that ˆ

θ1, . . . , ˆ θK are candidate estimators.

  • As before, we need to estimate R0(θ) to evaluate each ˆ

θk.

  • The naive estimator is

R(ˆ θk) = 1

n

n

i=1 L(Oi, ˆ

θk).

27 / 48

slide-99
SLIDE 99

Estimating the oracle risk

θ0 = arg min

θ∈Θ

R0(θ) R0(θ) = EP0[L(O, θ)]

  • Suppose that ˆ

θ1, . . . , ˆ θK are candidate estimators.

  • As before, we need to estimate R0(θ) to evaluate each ˆ

θk.

  • The naive estimator is

R(ˆ θk) = 1

n

n

i=1 L(Oi, ˆ

θk).

  • We instead estimate R0(θ) using the cross-validated risk
  • RCV(ˆ

θk) = 1 V

V

  • v=1

1 |Vv|

  • i∈Vv

L(Oi, ˆ θk,v).

27 / 48

slide-100
SLIDE 100

Super Learner: general steps

Using this framework, we can generalize the SL recipe:

28 / 48

slide-101
SLIDE 101

Super Learner: general steps

Using this framework, we can generalize the SL recipe:

  • 1. Define a library of candidate algorithms ˆ

θ1, . . . , ˆ θK.

28 / 48

slide-102
SLIDE 102

Super Learner: general steps

Using this framework, we can generalize the SL recipe:

  • 1. Define a library of candidate algorithms ˆ

θ1, . . . , ˆ θK.

  • 2. Obtain the CV-Risks

RCV(ˆ θk), k = 1, . . . , K.

28 / 48

slide-103
SLIDE 103

Super Learner: general steps

Using this framework, we can generalize the SL recipe:

  • 1. Define a library of candidate algorithms ˆ

θ1, . . . , ˆ θK.

  • 2. Obtain the CV-Risks

RCV(ˆ θk), k = 1, . . . , K.

  • 3. Use constrained optimization to compute the SL weights
  • λ := arg minλ∈SK

RCV K

k=1 λk ˆ

θk

  • .

28 / 48

slide-104
SLIDE 104

Super Learner: general steps

Using this framework, we can generalize the SL recipe:

  • 1. Define a library of candidate algorithms ˆ

θ1, . . . , ˆ θK.

  • 2. Obtain the CV-Risks

RCV(ˆ θk), k = 1, . . . , K.

  • 3. Use constrained optimization to compute the SL weights
  • λ := arg minλ∈SK

RCV K

k=1 λk ˆ

θk

  • .
  • 4. Take ˆ

θSL = K

k=1

λk ˆ θk.

28 / 48

slide-105
SLIDE 105

Theoretical guarantees

van der Vaart et al. (2006) showed that, under some conditions, the oracle risk of the SL estimator is as good as the oracle risk of the oracle minimizer up to a multiple of log n

n

as long as the number of candidate algorithms is polynomial in n.

29 / 48

slide-106
SLIDE 106

Loss functions for a binary

  • utcome

We return to O = (Y, X), θ = µ.

30 / 48

slide-107
SLIDE 107

Loss functions for a binary

  • utcome

We return to O = (Y, X), θ = µ.

  • For continuous Y, we used squared-error loss.

30 / 48

slide-108
SLIDE 108

Loss functions for a binary

  • utcome

We return to O = (Y, X), θ = µ.

  • For continuous Y, we used squared-error loss.
  • For binary Y, squared-error loss is still valid.

30 / 48

slide-109
SLIDE 109

Loss functions for a binary

  • utcome

We return to O = (Y, X), θ = µ.

  • For continuous Y, we used squared-error loss.
  • For binary Y, squared-error loss is still valid.
  • However, there are (at least) two other alternative loss

functions for a binary outcome.

30 / 48

slide-110
SLIDE 110

Loss functions for a binary

  • utcome

We return to O = (Y, X), θ = µ.

  • For continuous Y, we used squared-error loss.
  • For binary Y, squared-error loss is still valid.
  • However, there are (at least) two other alternative loss

functions for a binary outcome. – Negative log-likelihood loss: L(O, µ) = −Y log µ(X) − [1 − Y] log[1 − µ(X)].

30 / 48

slide-111
SLIDE 111

Loss functions for a binary

  • utcome

We return to O = (Y, X), θ = µ.

  • For continuous Y, we used squared-error loss.
  • For binary Y, squared-error loss is still valid.
  • However, there are (at least) two other alternative loss

functions for a binary outcome. – Negative log-likelihood loss: L(O, µ) = −Y log µ(X) − [1 − Y] log[1 − µ(X)]. – AUC loss.

30 / 48

slide-112
SLIDE 112
  • IV. Lab 2:

Vanilla SL for a binary

  • utcome

30 / 48

slide-113
SLIDE 113

15 minute break

30 / 48

slide-114
SLIDE 114
  • V. Bells and whistles:

Screens, weights, and CV-SL

30 / 48

slide-115
SLIDE 115

Overview

In this section, we will introduce three of the add-ons to SL that are frequently useful in practice: variable screens,

  • bservation weights, and cross-validated SL.

31 / 48

slide-116
SLIDE 116

Variable screens

  • We think of a candidate algorithm as a two-step procedure:

32 / 48

slide-117
SLIDE 117

Variable screens

  • We think of a candidate algorithm as a two-step procedure:
  • 1. Select a subset of the covariates.

32 / 48

slide-118
SLIDE 118

Variable screens

  • We think of a candidate algorithm as a two-step procedure:
  • 1. Select a subset of the covariates.
  • 2. Use the selected subset to fit a model.

32 / 48

slide-119
SLIDE 119

Variable screens

  • We think of a candidate algorithm as a two-step procedure:
  • 1. Select a subset of the covariates.
  • 2. Use the selected subset to fit a model.
  • We call step 1 a screening procedure.

32 / 48

slide-120
SLIDE 120

Variable screens

  • We think of a candidate algorithm as a two-step procedure:
  • 1. Select a subset of the covariates.
  • 2. Use the selected subset to fit a model.
  • We call step 1 a screening procedure.
  • While we could program steps 1 and 2 by hand in to each

candidate algorithm, the SuperLearner package has built-in functionality to ease this process.

32 / 48

slide-121
SLIDE 121

Variable screens

  • We think of a candidate algorithm as a two-step procedure:
  • 1. Select a subset of the covariates.
  • 2. Use the selected subset to fit a model.
  • We call step 1 a screening procedure.
  • While we could program steps 1 and 2 by hand in to each

candidate algorithm, the SuperLearner package has built-in functionality to ease this process. Screening algorithms allow us to guide the SL using our domain knowledge.

32 / 48

slide-122
SLIDE 122

Example use-cases of screening

  • If we have a high-dimensional set of covariates, we can try

different ways of reducing the dimensionality.

33 / 48

slide-123
SLIDE 123

Example use-cases of screening

  • If we have a high-dimensional set of covariates, we can try

different ways of reducing the dimensionality.

  • If we have a large number of “raw” measurements, we

might try providing a smaller number of summary measures – e.g. mean, median, min, max.

33 / 48

slide-124
SLIDE 124

Example use-cases of screening

  • If we have a high-dimensional set of covariates, we can try

different ways of reducing the dimensionality.

  • If we have a large number of “raw” measurements, we

might try providing a smaller number of summary measures – e.g. mean, median, min, max.

  • If we have measurements collected at multiple time

points, we might try providing just baseline, or just the last time point, or some summaries of the trajectory.

33 / 48

slide-125
SLIDE 125

Example use-cases of screening

  • If we have a high-dimensional set of covariates, we can try

different ways of reducing the dimensionality.

  • If we have a large number of “raw” measurements, we

might try providing a smaller number of summary measures – e.g. mean, median, min, max.

  • If we have measurements collected at multiple time

points, we might try providing just baseline, or just the last time point, or some summaries of the trajectory.

  • We can force certain variables to always be used.

33 / 48

slide-126
SLIDE 126

Observation weights

  • In some applications, we need to include observation

weights in the procedure – e.g. case-control sampling,

  • r as a simple way to account for loss-to-followup.

34 / 48

slide-127
SLIDE 127

Observation weights

  • In some applications, we need to include observation

weights in the procedure – e.g. case-control sampling,

  • r as a simple way to account for loss-to-followup.
  • Observation weights can be included directly in a call to

SuperLearner, but method.AUC does not make correct use of weights!!!!

34 / 48

slide-128
SLIDE 128

Observation weights

  • In some applications, we need to include observation

weights in the procedure – e.g. case-control sampling,

  • r as a simple way to account for loss-to-followup.
  • Observation weights can be included directly in a call to

SuperLearner, but method.AUC does not make correct use of weights!!!!

  • Note that some SuperLearner wrappers might not make

use of observation weights.

34 / 48

slide-129
SLIDE 129

Case-control weights

  • Let Y represent disease status at the end of a study.

35 / 48

slide-130
SLIDE 130

Case-control weights

  • Let Y represent disease status at the end of a study.
  • Suppose specimens from all ncase cases (Yi = 1) are

assayed.

35 / 48

slide-131
SLIDE 131

Case-control weights

  • Let Y represent disease status at the end of a study.
  • Suppose specimens from all ncase cases (Yi = 1) are

assayed.

  • A random subset of Ncontrol controls (Yi = 0) (out of

ncontrol total controls) are assayed.

35 / 48

slide-132
SLIDE 132

Case-control weights

  • Let Y represent disease status at the end of a study.
  • Suppose specimens from all ncase cases (Yi = 1) are

assayed.

  • A random subset of Ncontrol controls (Yi = 0) (out of

ncontrol total controls) are assayed.

  • We will use this case-control cohort to predict disease

status using the results of the assay and other covariates.

35 / 48

slide-133
SLIDE 133

Case-control weights

  • We can use SL with observation weights.

36 / 48

slide-134
SLIDE 134

Case-control weights

  • We can use SL with observation weights.
  • Cases have weight wi = 1.

36 / 48

slide-135
SLIDE 135

Case-control weights

  • We can use SL with observation weights.
  • Cases have weight wi = 1.
  • Controls have weight wi = ncontrol/Ncontrol.

36 / 48

slide-136
SLIDE 136

Case-control weights

  • We can use SL with observation weights.
  • Cases have weight wi = 1.
  • Controls have weight wi = ncontrol/Ncontrol.
  • Control weights could also be estimated using a logistic

regression of the indicator of inclusion in the control cohort

  • n baseline covariates.

36 / 48

slide-137
SLIDE 137

Right-censored outcomes

  • Suppose Y = I(T ≤ t0) indicates that disease occurs

before time t0.

37 / 48

slide-138
SLIDE 138

Right-censored outcomes

  • Suppose Y = I(T ≤ t0) indicates that disease occurs

before time t0.

  • T is subject to right-censoring by C: we observe

Y = min{T, C} and ∆ = I(T ≤ C).

37 / 48

slide-139
SLIDE 139

Right-censored outcomes

  • Suppose Y = I(T ≤ t0) indicates that disease occurs

before time t0.

  • T is subject to right-censoring by C: we observe

Y = min{T, C} and ∆ = I(T ≤ C).

  • We want to estimate

µ(x) = P(T ≤ t0 | X = x) = E[Y | X = x].

37 / 48

slide-140
SLIDE 140

Right-censored outcomes

µ0 = arg min

µ

EP0

G0(Y | X)L((Y, X), µ)

  • Here, G0(t | x) = P0(C > t | X = x).
  • L either squared-error or negative log-likelihood loss.

38 / 48

slide-141
SLIDE 141

Right-censored outcomes

µ0 = arg min

µ

EP0

G0(Y | X)L((Y, X), µ)

  • Here, G0(t | x) = P0(C > t | X = x).
  • L either squared-error or negative log-likelihood loss.
  • If we knew G0, we could use SL with weight

∆ G0(Y|X).

38 / 48

slide-142
SLIDE 142

Right-censored outcomes

µ0 = arg min

µ

EP0

G0(Y | X)L((Y, X), µ)

  • Here, G0(t | x) = P0(C > t | X = x).
  • L either squared-error or negative log-likelihood loss.
  • If we knew G0, we could use SL with weight

∆ G0(Y|X).

  • Instead, we estimate G0 and plug in this estimator to
  • btain an estimated weight.

38 / 48

slide-143
SLIDE 143

Right-censored outcomes

µ0 = arg min

µ

EP0

G0(Y | X)L((Y, X), µ)

  • Here, G0(t | x) = P0(C > t | X = x).
  • L either squared-error or negative log-likelihood loss.
  • If we knew G0, we could use SL with weight

∆ G0(Y|X).

  • Instead, we estimate G0 and plug in this estimator to
  • btain an estimated weight.
  • If C ⊥

⊥ T, we can use a Kaplan-Meier estimator for G0;

  • therwise we might use a Cox model.

38 / 48

slide-144
SLIDE 144

CV-Super Learner

  • The standard SL framework gives us CV risks for each

candidate algorithm.

39 / 48

slide-145
SLIDE 145

CV-Super Learner

  • The standard SL framework gives us CV risks for each

candidate algorithm.

  • However, the SL and discrete SL are obtained using all the

data, so their estimated risks will be optimistic.

39 / 48

slide-146
SLIDE 146

CV-Super Learner

  • The standard SL framework gives us CV risks for each

candidate algorithm.

  • However, the SL and discrete SL are obtained using all the

data, so their estimated risks will be optimistic.

  • We can rectify this using a second layer of

cross-validation.

39 / 48

slide-147
SLIDE 147

CV-Super Learner

  • The standard SL framework gives us CV risks for each

candidate algorithm.

  • However, the SL and discrete SL are obtained using all the

data, so their estimated risks will be optimistic.

  • We can rectify this using a second layer of

cross-validation.

39 / 48

slide-148
SLIDE 148

CV-Super Learner

  • 1. Split the data into V1 folds.

40 / 48

slide-149
SLIDE 149

CV-Super Learner

  • 1. Split the data into V1 folds.
  • 2. For v = 1, . . . , V1:

40 / 48

slide-150
SLIDE 150

CV-Super Learner

  • 1. Split the data into V1 folds.
  • 2. For v = 1, . . . , V1:
  • a. Run regular SL on the training set for fold v using

V2-fold CV.

40 / 48

slide-151
SLIDE 151

CV-Super Learner

  • 1. Split the data into V1 folds.
  • 2. For v = 1, . . . , V1:
  • a. Run regular SL on the training set for fold v using

V2-fold CV.

  • b. Obtain discrete SL and SL predictions for the

validation set for fold v.

40 / 48

slide-152
SLIDE 152

CV-Super Learner

  • 1. Split the data into V1 folds.
  • 2. For v = 1, . . . , V1:
  • a. Run regular SL on the training set for fold v using

V2-fold CV.

  • b. Obtain discrete SL and SL predictions for the

validation set for fold v.

  • 3. Combine the validation sets to obtain CV-risks for the

discrete SL and SL.

40 / 48

slide-153
SLIDE 153
  • VI. Lab 3:

Binary outcome redux

40 / 48

slide-154
SLIDE 154
  • VII. Lab 4:

Case-control analysis

  • f Fluzone vaccine

40 / 48

slide-155
SLIDE 155

FLUVACS trial

  • Health adults aged 18–49 years, Michigan, 2007–2008.

41 / 48

slide-156
SLIDE 156

FLUVACS trial

  • Health adults aged 18–49 years, Michigan, 2007–2008.
  • Randomly assigned to:

– Fluzone – inactivated influenza vaccine (IIV) – FluMist – live-attenuated influenza vaccine (LAIV) – placebo.

41 / 48

slide-157
SLIDE 157

FLUVACS trial

  • Health adults aged 18–49 years, Michigan, 2007–2008.
  • Randomly assigned to:

– Fluzone – inactivated influenza vaccine (IIV) – FluMist – live-attenuated influenza vaccine (LAIV) – placebo.

  • We are only interested in Fluzone vs placebo.

41 / 48

slide-158
SLIDE 158

FLUVACS trial

  • Health adults aged 18–49 years, Michigan, 2007–2008.
  • Randomly assigned to:

– Fluzone – inactivated influenza vaccine (IIV) – FluMist – live-attenuated influenza vaccine (LAIV) – placebo.

  • We are only interested in Fluzone vs placebo.
  • Followed for one flu season.
  • Endpoint = laboratory-confirmed influenza.

41 / 48

slide-159
SLIDE 159

FLUVACS trial

42 / 48

slide-160
SLIDE 160

FLUVACS trial

  • All 52 cases and 52 random controls were assayed for a

variety of markers (HAI, NAI, MN, AM titers, proteins/virus/peptide magnitude/breadth).

43 / 48

slide-161
SLIDE 161

FLUVACS trial

  • All 52 cases and 52 random controls were assayed for a

variety of markers (HAI, NAI, MN, AM titers, proteins/virus/peptide magnitude/breadth).

  • Measured variables:

– Demographics: age, vaccinated in last year (EVERVAX)

43 / 48

slide-162
SLIDE 162

FLUVACS trial

  • All 52 cases and 52 random controls were assayed for a

variety of markers (HAI, NAI, MN, AM titers, proteins/virus/peptide magnitude/breadth).

  • Measured variables:

– Demographics: age, vaccinated in last year (EVERVAX) – Day 0 markers

43 / 48

slide-163
SLIDE 163

FLUVACS trial

  • All 52 cases and 52 random controls were assayed for a

variety of markers (HAI, NAI, MN, AM titers, proteins/virus/peptide magnitude/breadth).

  • Measured variables:

– Demographics: age, vaccinated in last year (EVERVAX) – Day 0 markers – Day 30 markers

43 / 48

slide-164
SLIDE 164

FLUVACS trial

  • All 52 cases and 52 random controls were assayed for a

variety of markers (HAI, NAI, MN, AM titers, proteins/virus/peptide magnitude/breadth).

  • Measured variables:

– Demographics: age, vaccinated in last year (EVERVAX) – Day 0 markers – Day 30 markers – Difference markers = Day 30 markers - Day 0 markers

43 / 48

slide-165
SLIDE 165

Variable sets

  • 1. Demo.

44 / 48

slide-166
SLIDE 166

Variable sets

  • 1. Demo.
  • 2. Demo. + Day 0 markers

44 / 48

slide-167
SLIDE 167

Variable sets

  • 1. Demo.
  • 2. Demo. + Day 0 markers
  • 3. Demo. + Day 30 markers

44 / 48

slide-168
SLIDE 168

Variable sets

  • 1. Demo.
  • 2. Demo. + Day 0 markers
  • 3. Demo. + Day 30 markers
  • 4. Demo. + Difference markers

44 / 48

slide-169
SLIDE 169

Variable sets

  • 1. Demo.
  • 2. Demo. + Day 0 markers
  • 3. Demo. + Day 30 markers
  • 4. Demo. + Difference markers
  • 5. Demo. + Day 0 markers + EVERVAX × Day 0 markers

44 / 48

slide-170
SLIDE 170

Variable sets

  • 1. Demo.
  • 2. Demo. + Day 0 markers
  • 3. Demo. + Day 30 markers
  • 4. Demo. + Difference markers
  • 5. Demo. + Day 0 markers + EVERVAX × Day 0 markers
  • 6. Demo. + Day 30 markers + EVERVAX × Day 30 markers

44 / 48

slide-171
SLIDE 171

Variable sets

  • 1. Demo.
  • 2. Demo. + Day 0 markers
  • 3. Demo. + Day 30 markers
  • 4. Demo. + Difference markers
  • 5. Demo. + Day 0 markers + EVERVAX × Day 0 markers
  • 6. Demo. + Day 30 markers + EVERVAX × Day 30 markers
  • 7. Demo. + Diff. markers + EVERVAX × Diff. markers

44 / 48

slide-172
SLIDE 172

Variable sets

  • 1. Demo.
  • 2. Demo. + Day 0 markers
  • 3. Demo. + Day 30 markers
  • 4. Demo. + Difference markers
  • 5. Demo. + Day 0 markers + EVERVAX × Day 0 markers
  • 6. Demo. + Day 30 markers + EVERVAX × Day 30 markers
  • 7. Demo. + Diff. markers + EVERVAX × Diff. markers
  • 8. Demo. + Day 0 + Day 30 + EVERVAX × (Day 0 + Day 30)

44 / 48

slide-173
SLIDE 173

Variable sets

  • 1. Demo.
  • 2. Demo. + Day 0 markers
  • 3. Demo. + Day 30 markers
  • 4. Demo. + Difference markers
  • 5. Demo. + Day 0 markers + EVERVAX × Day 0 markers
  • 6. Demo. + Day 30 markers + EVERVAX × Day 30 markers
  • 7. Demo. + Diff. markers + EVERVAX × Diff. markers
  • 8. Demo. + Day 0 + Day 30 + EVERVAX × (Day 0 + Day 30)
  • 9. Demo. + Day 0 + Diff. + EVERVAX × (Day 0 + Diff.)

44 / 48

slide-174
SLIDE 174

Analysis goals

  • We want to compare the quality of these nine sets of

variables for predicting flu status in the placebo and Fluzone arms separately.

45 / 48

slide-175
SLIDE 175

Analysis goals

  • We want to compare the quality of these nine sets of

variables for predicting flu status in the placebo and Fluzone arms separately.

  • We also want to compare the predictive quality of IgA, IgG,

and both IgA + IgG measurements.

45 / 48

slide-176
SLIDE 176

Analysis goals

  • We want to compare the quality of these nine sets of

variables for predicting flu status in the placebo and Fluzone arms separately.

  • We also want to compare the predictive quality of IgA, IgG,

and both IgA + IgG measurements.

  • We will use cross-validated Super Learning to do this.

45 / 48

slide-177
SLIDE 177
  • EV x (Day 30 − Day 0)

EV x (Day 0, Day 30) EV x (Day 0, Diff) Day 30 − Day 0 EV x Day 0 EV x Day 30 Baseline Day 0 Day 30 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 SL Discrete SL SL.glm SL.bayesglm SL.glmnet SL.earth SL.gam SL.xgboost SL.ranger SL.mean SL Discrete SL SL.glm SL.bayesglm SL.glmnet SL.earth SL.gam SL.xgboost SL.ranger SL.mean SL Discrete SL SL.glm SL.bayesglm SL.glmnet SL.earth SL.gam SL.xgboost SL.ranger SL.mean

AUC Learner Screen

  • All

screen.marginal.05 screen.marginal.10

  • Both

IgA IgG Neither

46 / 48

slide-178
SLIDE 178
  • EV x (Day 30 − Day 0)

EV x (Day 0, Day 30) EV x (Day 0, Diff) Day 30 − Day 0 EV x Day 0 EV x Day 30 Baseline Day 0 Day 30 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 SL Discrete SL SL.xgboost SL Discrete SL SL.xgboost SL Discrete SL SL.xgboost SL Discrete SL SL.bayesglm SL Discrete SL SL.xgboost SL Discrete SL SL.xgboost SL Discrete SL SL.glm SL Discrete SL SL.bayesglm SL Discrete SL SL.gam

AUC Learner Screen

  • All

screen.marginal.05 screen.marginal.10

  • Both

IgA IgG Neither

47 / 48

slide-179
SLIDE 179
  • EV x (Day 30 − Day 0)

EV x (Day 0, Day 30) EV x (Day 0, Diff) Day 30 − Day 0 EV x Day 0 EV x Day 30 Baseline Day 0 Day 30 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 SL Discrete SL SL.glm SL.bayesglm SL.glmnet SL.earth SL.gam SL.xgboost SL.ranger SL.mean SL Discrete SL SL.glm SL.bayesglm SL.glmnet SL.earth SL.gam SL.xgboost SL.ranger SL.mean SL Discrete SL SL.glm SL.bayesglm SL.glmnet SL.earth SL.gam SL.xgboost SL.ranger SL.mean

AUC Learner Screen

  • All

screen.marginal.05 screen.marginal.10

  • Both

IgA IgG Neither

48 / 48

slide-180
SLIDE 180
  • EV x (Day 30 − Day 0)

EV x (Day 0, Day 30) EV x (Day 0, Diff) Day 30 − Day 0 EV x Day 0 EV x Day 30 Baseline Day 0 Day 30 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 SL Discrete SL SL.bayesglm SL Discrete SL SL.bayesglm SL Discrete SL SL.bayesglm SL Discrete SL SL.bayesglm SL Discrete SL SL.bayesglm SL Discrete SL SL.bayesglm SL Discrete SL SL.ranger SL Discrete SL SL.bayesglm SL Discrete SL SL.bayesglm

AUC Learner

  • Both

IgA IgG Neither

49 / 48

slide-181
SLIDE 181

Ferguson, T. S. (2014). Mathematical statistics: A decision theoretic

  • approach. Academic Press.

van der Vaart, A. W., Dudoit, S., and van der Laan, M. J. (2006). Oracle inequalities for multi-fold cross validation. Statistics & Decisions, 24(3):351–371. Vapnik, V. (1992). Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems, pages 831–838. Vapnik, V. (2013). The nature of statistical learning theory. Springer Science & Business Media. Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5):988–999.

50 / 48