Random Forest
Applied Multivariate Statistics – Spring 2012
Random Forest Applied Multivariate Statistics Spring 2012 Overview - - PowerPoint PPT Presentation
Random Forest Applied Multivariate Statistics Spring 2012 Overview Intuition of Random Forest The Random Forest Algorithm De-correlation gives better accuracy Healthy Diseased Out-of-bag error (OOB-error) Healthy Variable
Applied Multivariate Statistics – Spring 2012
1
Diseased Diseased Healthy Healthy Diseased
2
young
short tall healthy diseased young
diseased female male healthy healthy working retired healthy short tall healthy diseased New sample:
Tree predictions: diseased, healthy, diseased
Majority rule: diseased
healthy healthy diseased healthy Tree 1 Tree 3 Tree 2
3
(Bootstrap resample of data set with N samples: Make new data set by drawing with replacement N samples; i.e., some samples will probably occur multiple times in new data set)
voting to aggregate results
4
tree?
5
6
i=j Decreases, if number of trees B increases (irrespective of 𝜍) Decreaes, if 𝜍 decreases, i.e., if m decreases
De-correlation gives better accuracy
Estimating generalization error: Out-of bag (OOB) error
without any additional computational burden
resamples of the data
7
young
short tall healthy diseased diseased healthy Data:
young, tall – healthy young, short – diseased young, short – healthy young, tall – healthy
Resampled Data:
young, tall – healthy young, tall – healthy Out of bag samples: young, short – diseased young, short – healthy young, tall – healthy
Out of bag (OOB) error rate: ¼ = 0.25
Variable Importance for variable i using Permutations
8
Data
Resampled Dataset 1 OOB Data 1 Resampled Dataset m OOB Data m Tree 1 Tree m OOB error e1 OOB error em Permute values of variable i in OOB data set OOB error p1 OOB error pm
d = 1
m
Pm
i=1 di
d1 = e1–p1 dm =em-pm
s2
d = 1 m¡1
Pm
i=1(di ¡ d)2
+ Trees yield insight into decision rules + Rather fast + Easy to tune parameters
to have a high variance
9
+ RF as smaller prediction variance and therefore usually a better general performance + Easy to tune parameters
to get insights into decision rules
Comparing runtime (just for illustration)
10
RF Tree
RF: First predictor cut into 15 levels
+ Very fast + Discriminants for visualizing group separation + Can read off decision rule
boundaries
prediction error
+ Can model nonlinear class boundaries + OOB error “for free” (no CV needed) + Works on continuous and categorical responses (regression / classification) + Gives variable importance + Very good performance
11
x x x x x x x x x x x x x x x x x x x x x x x x
variance of trees
12
“randomForest”
13