Prediction, Estimation, and Attribution
Bradley Efron
brad@stat.stanford.edu Department of Statistics Stanford University
Prediction, Estimation, and Attribution Bradley Efron - - PowerPoint PPT Presentation
Prediction, Estimation, and Attribution Bradley Efron brad@stat.stanford.edu Department of Statistics Stanford University Regression Gauss (1809), Galton (1877) Prediction random forests, boosting, support vector machines, neural nets,
brad@stat.stanford.edu Department of Statistics Stanford University
Gauss (1809), Galton (1877)
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 2 36
Normal Linear Regression
i˛
n = X nˆp ˛ p + ǫ n
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 3 36
mass f
c e a c c e l e r a t i
Newton's 2nd law: acceleration=force/mass Bradley Efron, Stanford University Prediction, Estimation, and Attribution 4 36
mass f
c e A c c e l e r a t i
If Newton had done the experiment Bradley Efron, Stanford University Prediction, Estimation, and Attribution 5 36
The Cholesterol Data
i˛ + ›i
i = (1; ci; c2 i ; c3 i )
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 6 36
−1 1 2 −20 20 40 60 80 100
OLS cubic regression: cholesterol decrease vs normalized compliance; bars show 95% confidence intervals for the curve.
sigmahat=21.9; only intercept and linear coefs significant normalized compliance cholesterol decrease
Adj Rsquared =.481
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 7 36
800
800ˆ11; binomial)
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 8 36
predictive error 15%
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 9 36
Random Forests, Boosting, Deep Learning, . . .
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 10 36
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 11 36
|
cpap< 0.6654 gest>=−1.672 gest>=−1.941 ap>=−1.343 resp< 1.21 544/73 1 3/11 39/29 1 13/32 1 5/22 1 1/40 Classification Tree: 800 neonates, 200 died ( <<−− lived died −−>> )
worst bin
Prediction, Estimation, and Attribution 12 36
Breiman (2001)
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 13 36
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 14 36
for Prostate Cancer Prediction
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 15 36
100 200 300 400 500 0.0 0.1 0.2 0.3 0.4 0.5
Prostate cancer prediction using random forests Black is cross−validated training error, Red is test error rate
number trees error
train err 5.9% test err 2.0% Bradley Efron, Stanford University Prediction, Estimation, and Attribution 16 36
100 200 300 400 0.0 0.1 0.2 0.3 0.4 0.5
Now with boosting algorithm 'gbm'
# tree err.rate
error rates train 0%, test=4%
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 17 36
# parameters = 780; 738
acc
100 200 300 400 500 0.5 0.6 0.7 0.8 0.9 1.0
epoch data
training validation
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 18 36
ind
?
?
ff ffi
n
?
ff ffi
n
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 19 36
ind
8 > < > :
„
1=2
«
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 20 36
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 21 36
20 40 60 80 100 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Importance measures for genes in randomForest prostate analysis; Top two genes # 1022 and 5569
index Importance
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 22 36
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 23 36
100 200 300 400 500 0.0 0.1 0.2 0.3 0.4 0.5
Random Forests: Train on 50 earliest, Test on 50 latest subjects; Test error was 2%, now 24%
number trees error
train err 0% test err 24%
before 2%
Prediction, Estimation, and Attribution 24 36
100 200 300 400 0.0 0.1 0.2 0.3 0.4 0.5
Same thing for boosting (gbm) Test error now 29%, was 4%
# tree err.rate
error rates train 0, test=29%
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 25 36
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 26 36
−2 −1 1 2 20 40 60 80
Cholesterol data: randomForest estimate (X=poly(c,8)), 500 trees, compared with cubic regression curve
compliance chol decrease Adj R2 cubic .482 RandomForest .404
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 27 36
−2 −1 1 2 20 40 60 80
Now using boosting algorithm gbm
green dashed curve: 8th degree poly fit, adjRsq=.474 adjusted compliance cholesterol reduction
Cubic adjRsq .482 gbm crossval Rsq .461
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 28 36
1 Surface plus noise Direct prediction 2 Scientific truth Empirical prediction efficiency (eternal or at least long-lasting) (could be ephemeral, e.g., commerce) 3 X
nˆp : p < n (p moderate)
p > n (both possibly huge, “n = all”) 4 X chosen parsimoniously Anti-parsimony (main effects fl interactions) (algorithms expand X) 5 Parametric modeling Mostly nonparametric (condition on x’s; smoothness) ((x; y) pairs iid) 6 Homogeneous data (RCT) Very large heterogeneous data sets 7 Theory of optimal estimation Training and test sets (MLE) (CTF, asymptotics)
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 29 36
in the Wide-Data Era
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 30 36
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 31 36
for the Prostate Cancer Study
nˆp:
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 32 36
−6 −4 −2 2 4 6 −4 −2 2 4
fdr(z) and E{effect size|z}, prost data; Triangles: Red the 29 genes with fdr<.2; Green the 1st 29 glmnet genes
at z=4: fdr=.22 and E{del|z}=2.3
z value E{del|z}
4*fdr E{del|z}
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 33 36
Pp
1
˛ ˛ ˛ ^
˛ ˛ ˛
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 34 36
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 35 36
Bradley Efron, Stanford University Prediction, Estimation, and Attribution 36 36