Random Forest Applied Multivariate Statistics Spring 2012 Overview - - PowerPoint PPT Presentation

random forest
SMART_READER_LITE
LIVE PREVIEW

Random Forest Applied Multivariate Statistics Spring 2012 Overview - - PowerPoint PPT Presentation

Random Forest Applied Multivariate Statistics Spring 2012 Overview Intuition of Random Forest The Random Forest Algorithm De-correlation gives better accuracy Healthy Diseased Out-of-bag error (OOB-error) Healthy Variable


slide-1
SLIDE 1

Random Forest

Applied Multivariate Statistics – Spring 2012

slide-2
SLIDE 2

Overview

  • Intuition of Random Forest
  • The Random Forest Algorithm
  • De-correlation gives better accuracy
  • Out-of-bag error (OOB-error)
  • Variable importance

1

Diseased Diseased Healthy Healthy Diseased

slide-3
SLIDE 3

Intuition of Random Forest

2

young

  • ld

short tall healthy diseased young

  • ld

diseased female male healthy healthy working retired healthy short tall healthy diseased New sample:

  • ld, retired, male, short

Tree predictions: diseased, healthy, diseased

Majority rule: diseased

healthy healthy diseased healthy Tree 1 Tree 3 Tree 2

slide-4
SLIDE 4

The Random Forest Algorithm

3

slide-5
SLIDE 5

Differences to standard tree

  • Train each tree on bootstrap resample of data

(Bootstrap resample of data set with N samples: Make new data set by drawing with replacement N samples; i.e., some samples will probably occur multiple times in new data set)

  • For each split, consider only m randomly selected variables
  • Don’t prune
  • Fit B trees in such a way and use average or majority

voting to aggregate results

4

slide-6
SLIDE 6

Why Random Forest works 1/2

  • Mean Squared Error = Variance + Bias2
  • If trees are sufficiently deep, they have very small bias
  • How could we improve the variance over that of a single

tree?

5

slide-7
SLIDE 7

Why Random Forest works 2/2

6

i=j Decreases, if number of trees B increases (irrespective of 𝜍) Decreaes, if 𝜍 decreases, i.e., if m decreases

De-correlation gives better accuracy

slide-8
SLIDE 8

Estimating generalization error: Out-of bag (OOB) error

  • Similar to leave-one-out cross-validation, but almost

without any additional computational burden

  • OOB error is a random number, since based on random

resamples of the data

7

young

  • ld

short tall healthy diseased diseased healthy Data:

  • ld, tall – healthy
  • ld, short – diseased

young, tall – healthy young, short – diseased young, short – healthy young, tall – healthy

  • ld, short– diseased

Resampled Data:

  • ld, tall – healthy
  • ld, short – diseased

young, tall – healthy young, tall – healthy Out of bag samples: young, short – diseased young, short – healthy young, tall – healthy

  • ld, short – diseased

Out of bag (OOB) error rate: ¼ = 0.25

slide-9
SLIDE 9

Variable Importance for variable i using Permutations

8

Data

Resampled Dataset 1 OOB Data 1 Resampled Dataset m OOB Data m Tree 1 Tree m OOB error e1 OOB error em Permute values of variable i in OOB data set OOB error p1 OOB error pm

d = 1

m

Pm

i=1 di

d1 = e1–p1 dm =em-pm

s2

d = 1 m¡1

Pm

i=1(di ¡ d)2

vi =

d sd

slide-10
SLIDE 10

Trees vs. Random Forest

+ Trees yield insight into decision rules + Rather fast + Easy to tune parameters

  • Prediction of trees tend

to have a high variance

9

+ RF as smaller prediction variance and therefore usually a better general performance + Easy to tune parameters

  • Rather slow
  • “Black Box”: Rather hard

to get insights into decision rules

slide-11
SLIDE 11

Comparing runtime (just for illustration)

10

RF Tree

  • Up to “thousands” of variables
  • Problematic if there are categorical predictors with many levels (max: 32 levels)

RF: First predictor cut into 15 levels

slide-12
SLIDE 12

+ Very fast + Discriminants for visualizing group separation + Can read off decision rule

  • Can model only linear class

boundaries

  • Mediocre performance
  • No variable selection
  • Only on categorical response
  • Needs CV for estimating

prediction error

RF vs. LDA

+ Can model nonlinear class boundaries + OOB error “for free” (no CV needed) + Works on continuous and categorical responses (regression / classification) + Gives variable importance + Very good performance

  • “Black box”
  • Slow

11

x x x x x x x x x x x x x x x x x x x x x x x x

slide-13
SLIDE 13

Concepts to know

  • Idea of Random Forest and how it reduces the prediction

variance of trees

  • OOB error
  • Variable Importance based on Permutation

12

slide-14
SLIDE 14

R functions to know

  • Function “randomForest” and “varImpPlot” from package

“randomForest”

13