Why and how to use random forest Introduction Construction R - - PowerPoint PPT Presentation

why and how to use random forest
SMART_READER_LITE
LIVE PREVIEW

Why and how to use random forest Introduction Construction R - - PowerPoint PPT Presentation

Why and how to use random forest Introduction Construction R functions variable importance measures Variable importance (and how you shouldnt) Tests for variable importance Conditional importance Summary Carolin Strobl (LMU M


slide-1
SLIDE 1

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Why and how to use random forest variable importance measures (and how you shouldn’t)

Carolin Strobl (LMU M¨ unchen) and Achim Zeileis (WU Wien)

carolin.strobl@stat.uni-muenchen.de useR! 2008, Dortmund

slide-2
SLIDE 2

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Introduction

Random forests

slide-3
SLIDE 3

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Introduction

Random forests

◮ have become increasingly popular in, e.g., genetics and

the neurosciences

slide-4
SLIDE 4

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Introduction

Random forests

◮ have become increasingly popular in, e.g., genetics and

the neurosciences [imagine a long list of references here]

slide-5
SLIDE 5

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Introduction

Random forests

◮ have become increasingly popular in, e.g., genetics and

the neurosciences [imagine a long list of references here]

◮ can deal with “small n large p”-problems, high-order

interactions, correlated predictor variables

slide-6
SLIDE 6

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Introduction

Random forests

◮ have become increasingly popular in, e.g., genetics and

the neurosciences [imagine a long list of references here]

◮ can deal with “small n large p”-problems, high-order

interactions, correlated predictor variables

◮ are used not only for prediction, but also to assess

variable importance

slide-7
SLIDE 7

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

(Small) random forest

Start p < 0.001 1 ≤ ≤ 8 > > 8 n = 15 y = (0.4, 0.6) 2 Start p < 0.001 3 ≤ ≤ 14 > > 14 n = 34 y = (0.882, 0.118) 4 n = 32 y = (1, 0) 5 Start p < 0.001 1 ≤ ≤ 12 > > 12 n = 38 y = (0.711, 0.289) 2 Number p < 0.001 3 ≤ ≤ 3 > > 3 n = 25 y = (1, 0) 4 n = 18 y = (0.889, 0.111) 5 Start p < 0.001 1 ≤ ≤ 12 > > 12 Age p < 0.001 2 ≤ ≤ 27 > > 27 n = 10 y = (1, 0) 3 Number p < 0.001 4 ≤ ≤ 4 > > 4 n = 14 y = (0.357, 0.643) 5 n = 9 y = (0.111, 0.889) 6 Start p < 0.001 7 ≤ ≤ 13 > > 13 n = 11 y = (0.818, 0.182) 8 n = 37 y = (1, 0) 9 Start p < 0.001 1 ≤ ≤ 8 > > 8 Start p < 0.001 2 ≤ ≤ 1 > > 1 n = 9 y = (0.778, 0.222) 3 n = 13 y = (0.154, 0.846) 4 Start p < 0.001 5 ≤ ≤ 12 > > 12 n = 12 y = (0.833, 0.167) 6 n = 47 y = (1, 0) 7 Start p < 0.001 1 ≤ ≤ 8 > > 8 n = 13 y = (0.308, 0.692) 2 Age p < 0.001 3 ≤ ≤ 87 > 87 n = 36 y = (1, 0) 4 Start p < 0.001 5 ≤ 13 > 13 n = 16 y = (0.75, 0.25) 6 n = 16 y = (1, 0) 7 Number p < 0.001 1 ≤ ≤ 5 > 5 Age p < 0.001 2 ≤ ≤ 81 > > 81 n = 33 y = (1, 0) 3 Start p < 0.001 4 ≤ ≤ 12 > > 12 n = 13 y = (0.385, 0.615) 5 Start p < 0.001 6 ≤ 15 > 15 n = 12 y = (0.833, 0.167) 7 n = 12 y = (1, 0) 8 n = 11 y = (0.364, 0.636) 9 Start p < 0.001 1 ≤ ≤ 12 > 12 Age p < 0.001 2 ≤ ≤ 81 > > 81 n = 20 y = (0.85, 0.15) 3 n = 16 y = (0.188, 0.812) 4 Start p < 0.001 5 ≤ 13 > 13 n = 11 y = (0.818, 0.182) 6 n = 34 y = (1, 0) 7 Start p < 0.001 1 ≤ ≤ 12 > 12 Age p < 0.001 2 ≤ ≤ 71 > > 71 n = 15 y = (0.667, 0.333) 3 n = 17 y = (0.235, 0.765) 4 Start p < 0.001 5 ≤ 14 > 14 n = 17 y = (0.882, 0.118) 6 n = 32 y = (1, 0) 7 Start p < 0.001 1 ≤ 12 > 12 Age p < 0.001 2 ≤ 68 > 68 Number p < 0.001 3 ≤ 4 > 4 n = 11 y = (1, 0) 4 n = 9 y = (0.556, 0.444) 5 n = 12 y = (0.25, 0.75) 6 n = 49 y = (1, 0) 7 Start p < 0.001 1 ≤ 12 > 12 Age p < 0.001 2 ≤ 18 > 18 n = 10 y = (0.9, 0.1) 3 Number p < 0.001 4 ≤ 4 > 4 n = 12 y = (0.417, 0.583) 5 n = 10 y = (0.2, 0.8) 6 Number p < 0.001 7 ≤ 3 > 3 n = 28 y = (1, 0) 8 n = 21 y = (0.952, 0.048) 9 Start p < 0.001 1 ≤ 8 > 8 Start p < 0.001 2 ≤ 3 > 3 n = 12 y = (0.667, 0.333) 3 n = 14 y = (0.143, 0.857) 4 Age p < 0.001 5 ≤ 136 > 136 n = 47 y = (1, 0) 6 n = 8 y = (0.75, 0.25) 7 Start p < 0.001 1 ≤ 12 > 12 n = 28 y = (0.607, 0.393) 2 Start p < 0.001 3 ≤ 14 > 14 n = 21 y = (0.905, 0.095) 4 n = 32 y = (1, 0) 5 Start p < 0.001 1 ≤ 1 > 1 n = 8 y = (0.375, 0.625) 2 Number p < 0.001 3 ≤ 4 > 4 Age p < 0.001 4 ≤ 125 > 125 n = 31 y = (1, 0) 5 n = 11 y = (0.818, 0.182) 6 n = 31 y = (0.806, 0.194) 7 Start p < 0.001 1 ≤ 14 > 14 Age p < 0.001 2 ≤ 71 > 71 n = 15 y = (0.933, 0.067) 3 Start p < 0.001 4 ≤ 12 > 12 n = 16 y = (0.375, 0.625) 5 n = 15 y = (0.733, 0.267) 6 n = 35 y = (1, 0) 7 Number p < 0.001 1 ≤ 6 > 6 Number p < 0.001 2 ≤ 3 > 3 Start p < 0.001 3 ≤ 13 > 13 n = 10 y = (0.8, 0.2) 4 n = 24 y = (1, 0) 5 n = 37 y = (0.865, 0.135) 6 n = 10 y = (0.5, 0.5) 7 Start p < 0.001 1 ≤ 8 > 8 n = 18 y = (0.5, 0.5) 2 Start p < 0.001 3 ≤ 12 > 12 n = 18 y = (0.833, 0.167) 4 Number p < 0.001 5 ≤ 3 > 3 n = 30 y = (1, 0) 6 n = 15 y = (0.933, 0.067) 7

slide-8
SLIDE 8

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Construction of a random forest

slide-9
SLIDE 9

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Construction of a random forest

◮ draw ntree bootstrap samples from original sample

slide-10
SLIDE 10

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Construction of a random forest

◮ draw ntree bootstrap samples from original sample ◮ fit a classification tree to each bootstrap sample

⇒ ntree trees

slide-11
SLIDE 11

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Construction of a random forest

◮ draw ntree bootstrap samples from original sample ◮ fit a classification tree to each bootstrap sample

⇒ ntree trees

◮ creates diverse set of trees because

◮ trees are instable w.r.t. changes in learning data

⇒ ntree different looking trees (bagging)

◮ randomly preselect mtry splitting variables in each split

⇒ ntree more different looking trees (random forest)

slide-12
SLIDE 12

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Random forests in R

◮ randomForest (pkg: randomForest)

◮ reference implementation based on CART trees

(Breiman, 2001; Liaw and Wiener, 2008) – for variables of different types: biased in favor of continuous variables and variables with many categories (Strobl, Boulesteix, Zeileis, and Hothorn, 2007)

◮ cforest (pkg: party)

◮ based on unbiased conditional inference trees

(Hothorn, Hornik, and Zeileis, 2006) + for variables of different types: unbiased when subsampling, instead of bootstrap sampling, is used (Strobl, Boulesteix, Zeileis, and Hothorn, 2007)

slide-13
SLIDE 13

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

(Small) random forest

Start p < 0.001 1 ≤ ≤ 8 > > 8 n = 15 y = (0.4, 0.6) 2 Start p < 0.001 3 ≤ ≤ 14 > > 14 n = 34 y = (0.882, 0.118) 4 n = 32 y = (1, 0) 5 Start p < 0.001 1 ≤ ≤ 12 > > 12 n = 38 y = (0.711, 0.289) 2 Number p < 0.001 3 ≤ ≤ 3 > > 3 n = 25 y = (1, 0) 4 n = 18 y = (0.889, 0.111) 5 Start p < 0.001 1 ≤ ≤ 12 > > 12 Age p < 0.001 2 ≤ ≤ 27 > > 27 n = 10 y = (1, 0) 3 Number p < 0.001 4 ≤ ≤ 4 > > 4 n = 14 y = (0.357, 0.643) 5 n = 9 y = (0.111, 0.889) 6 Start p < 0.001 7 ≤ ≤ 13 > > 13 n = 11 y = (0.818, 0.182) 8 n = 37 y = (1, 0) 9 Start p < 0.001 1 ≤ ≤ 8 > > 8 Start p < 0.001 2 ≤ ≤ 1 > > 1 n = 9 y = (0.778, 0.222) 3 n = 13 y = (0.154, 0.846) 4 Start p < 0.001 5 ≤ ≤ 12 > > 12 n = 12 y = (0.833, 0.167) 6 n = 47 y = (1, 0) 7 Start p < 0.001 1 ≤ ≤ 8 > > 8 n = 13 y = (0.308, 0.692) 2 Age p < 0.001 3 ≤ ≤ 87 > 87 n = 36 y = (1, 0) 4 Start p < 0.001 5 ≤ 13 > 13 n = 16 y = (0.75, 0.25) 6 n = 16 y = (1, 0) 7 Number p < 0.001 1 ≤ ≤ 5 > 5 Age p < 0.001 2 ≤ ≤ 81 > > 81 n = 33 y = (1, 0) 3 Start p < 0.001 4 ≤ ≤ 12 > > 12 n = 13 y = (0.385, 0.615) 5 Start p < 0.001 6 ≤ 15 > 15 n = 12 y = (0.833, 0.167) 7 n = 12 y = (1, 0) 8 n = 11 y = (0.364, 0.636) 9 Start p < 0.001 1 ≤ ≤ 12 > 12 Age p < 0.001 2 ≤ ≤ 81 > > 81 n = 20 y = (0.85, 0.15) 3 n = 16 y = (0.188, 0.812) 4 Start p < 0.001 5 ≤ 13 > 13 n = 11 y = (0.818, 0.182) 6 n = 34 y = (1, 0) 7 Start p < 0.001 1 ≤ ≤ 12 > 12 Age p < 0.001 2 ≤ ≤ 71 > > 71 n = 15 y = (0.667, 0.333) 3 n = 17 y = (0.235, 0.765) 4 Start p < 0.001 5 ≤ 14 > 14 n = 17 y = (0.882, 0.118) 6 n = 32 y = (1, 0) 7 Start p < 0.001 1 ≤ 12 > 12 Age p < 0.001 2 ≤ 68 > 68 Number p < 0.001 3 ≤ 4 > 4 n = 11 y = (1, 0) 4 n = 9 y = (0.556, 0.444) 5 n = 12 y = (0.25, 0.75) 6 n = 49 y = (1, 0) 7 Start p < 0.001 1 ≤ 12 > 12 Age p < 0.001 2 ≤ 18 > 18 n = 10 y = (0.9, 0.1) 3 Number p < 0.001 4 ≤ 4 > 4 n = 12 y = (0.417, 0.583) 5 n = 10 y = (0.2, 0.8) 6 Number p < 0.001 7 ≤ 3 > 3 n = 28 y = (1, 0) 8 n = 21 y = (0.952, 0.048) 9 Start p < 0.001 1 ≤ 8 > 8 Start p < 0.001 2 ≤ 3 > 3 n = 12 y = (0.667, 0.333) 3 n = 14 y = (0.143, 0.857) 4 Age p < 0.001 5 ≤ 136 > 136 n = 47 y = (1, 0) 6 n = 8 y = (0.75, 0.25) 7 Start p < 0.001 1 ≤ 12 > 12 n = 28 y = (0.607, 0.393) 2 Start p < 0.001 3 ≤ 14 > 14 n = 21 y = (0.905, 0.095) 4 n = 32 y = (1, 0) 5 Start p < 0.001 1 ≤ 1 > 1 n = 8 y = (0.375, 0.625) 2 Number p < 0.001 3 ≤ 4 > 4 Age p < 0.001 4 ≤ 125 > 125 n = 31 y = (1, 0) 5 n = 11 y = (0.818, 0.182) 6 n = 31 y = (0.806, 0.194) 7 Start p < 0.001 1 ≤ 14 > 14 Age p < 0.001 2 ≤ 71 > 71 n = 15 y = (0.933, 0.067) 3 Start p < 0.001 4 ≤ 12 > 12 n = 16 y = (0.375, 0.625) 5 n = 15 y = (0.733, 0.267) 6 n = 35 y = (1, 0) 7 Number p < 0.001 1 ≤ 6 > 6 Number p < 0.001 2 ≤ 3 > 3 Start p < 0.001 3 ≤ 13 > 13 n = 10 y = (0.8, 0.2) 4 n = 24 y = (1, 0) 5 n = 37 y = (0.865, 0.135) 6 n = 10 y = (0.5, 0.5) 7 Start p < 0.001 1 ≤ 8 > 8 n = 18 y = (0.5, 0.5) 2 Start p < 0.001 3 ≤ 12 > 12 n = 18 y = (0.833, 0.167) 4 Number p < 0.001 5 ≤ 3 > 3 n = 30 y = (1, 0) 6 n = 15 y = (0.933, 0.067) 7

slide-14
SLIDE 14

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Measuring variable importance

◮ Gini importance

mean Gini gain produced by Xj over all trees

◮ obj <- randomForest(..., importance=TRUE)

  • bj$importance

column: MeanDecreaseGini importance(obj, type=2)

for variables of different types: biased in favor of continuous variables and variables with many categories

slide-15
SLIDE 15

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Measuring variable importance

◮ permutation importance

mean decrease in classification accuracy after permuting Xj over all trees

◮ obj <- randomForest(..., importance=TRUE)

  • bj$importance

column: MeanDecreaseAccuracy importance(obj, type=1)

◮ obj <- cforest(...)

varimp(obj)

for variables of different types: unbiased only when subsampling is used as in cforest(..., controls = cforest unbiased())

slide-16
SLIDE 16

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

The permutation importance

within each tree t VI (t)(xj) =

  • i∈B

(t) I

  • yi = ˆ

y(t)

i

  • B

(t)

  • i∈B

(t) I

  • yi = ˆ

y(t)

i,πj

  • B

(t)

  • ˆ

y(t)

i

= f (t)(xi) = predicted class before permuting ˆ y(t)

i,πj = f (t)(xi,πj) = predicted class after permuting Xj

xi,πj = (xi,1, . . . , xi,j−1, xπj(i),j, xi,j+1, . . . , xi,p

  • Note: VI (t)(xj) = 0 by definition, if Xj is not in tree t
slide-17
SLIDE 17

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

The permutation importance

  • ver all trees:
  • 1. raw importance

VI(xj) = ntree

t=1 VI (t)(xj)

ntree

◮ obj <- randomForest(..., importance=TRUE)

importance(obj, type=1, scale=FALSE)

slide-18
SLIDE 18

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

The permutation importance

  • ver all trees:
  • 2. scaled importance (z-score)

VI(xj)

ˆ σ √ntree

= zj

◮ obj <- randomForest(..., importance=TRUE)

importance(obj, type=1, scale=TRUE) (default)

slide-19
SLIDE 19

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Tests for variable importance

for variable selection purposes

slide-20
SLIDE 20

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Tests for variable importance

for variable selection purposes

◮ Breiman and Cutler (2008): simple significance test

based on normality of z-score randomForest, scale=TRUE + α-quantile of N(0,1)

slide-21
SLIDE 21

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Tests for variable importance

for variable selection purposes

◮ Breiman and Cutler (2008): simple significance test

based on normality of z-score randomForest, scale=TRUE + α-quantile of N(0,1)

◮ Diaz-Uriarte and Alvarez de Andr´

es (2006): backward elimination (throw out least important variables until

  • ut-of-bag prediction accuracy drops)

varSelRF (pkg: varSelRF), dep. on randomForest

slide-22
SLIDE 22

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Tests for variable importance

for variable selection purposes

◮ Breiman and Cutler (2008): simple significance test

based on normality of z-score randomForest, scale=TRUE + α-quantile of N(0,1)

◮ Diaz-Uriarte and Alvarez de Andr´

es (2006): backward elimination (throw out least important variables until

  • ut-of-bag prediction accuracy drops)

varSelRF (pkg: varSelRF), dep. on randomForest

◮ Diaz-Uriarte (2007) and Rodenburg et al. (2008): plots

and significance test (randomly permute response values to mimic the overall null hypothesis that none of the predictor variables is relevant = baseline)

slide-23
SLIDE 23

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Tests for variable importance

problems of these approaches:

slide-24
SLIDE 24

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Tests for variable importance

problems of these approaches:

◮ (at least) Breiman and Cutler (2008): strange statistical

properties (Strobl and Zeileis, 2008)

slide-25
SLIDE 25

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Tests for variable importance

problems of these approaches:

◮ (at least) Breiman and Cutler (2008): strange statistical

properties (Strobl and Zeileis, 2008)

◮ all: preference of correlated predictor variables (see also

Nicodemus and Shugart, 2007; Archer and Kimes, 2008)

slide-26
SLIDE 26

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Breiman and Cutler’s test

under the null hypothesis of zero importance: zj

as.

∼ N(0, 1) if zj exceeds the α-quantile of N(0,1) ⇒ reject the null hypothesis of zero importance for variable Xj

slide-27
SLIDE 27

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Raw importance

relevance

0.0 0.1 0.2 0.3 0.4

ntree = 100 mean importance

0.0 0.1 0.2 0.3 0.4

ntree = 200 mean importance

0.0 0.1 0.2 0.3 0.4

ntree = 500 mean importance

sample size

100 200 500

slide-28
SLIDE 28

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

z-score and power

relevance

0.0 0.1 0.2 0.3 0.4

ntree = 100 power

0.0 0.1 0.2 0.3 0.4

ntree = 200 power

0.0 0.1 0.2 0.3 0.4

ntree = 500 power

0.0 0.1 0.2 0.3 0.4

ntree = 100 z−score

0.0 0.1 0.2 0.3 0.4

ntree = 200 z−score

0.0 0.1 0.2 0.3 0.4

ntree = 500 z−score

sample size

100 200 500

slide-29
SLIDE 29

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Findings

z-score and power

◮ increase in ntree ◮ decrease in sample size

⇒ rather use raw, unscaled permutation importance! importance(obj, type=1, scale=FALSE) varimp(obj)

slide-30
SLIDE 30

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

What null hypothesis were we testing in the first place?

  • bs

Y Xj Z 1 y1 xπj(1),j z1 . . . . . . . . . . . . i yi xπj(i),j zi . . . . . . . . . . . . n yn xπj(n),j zn H0 : Xj ⊥ Y , Z or Xj ⊥ Y ∧ Xj ⊥ Z P(Y , Xj, Z)

H0

= P(Y , Z) · P(Xj)

slide-31
SLIDE 31

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

What null hypothesis were we testing in the first place?

the current null hypothesis reflects independence of Xj from both Y and the remaining predictor variables Z

slide-32
SLIDE 32

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

What null hypothesis were we testing in the first place?

the current null hypothesis reflects independence of Xj from both Y and the remaining predictor variables Z ⇒ a high variable importance can result from violation of either one!

slide-33
SLIDE 33

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Suggestion: Conditional permutation scheme

  • bs

Y Xj Z 1 y1 xπj|Z=a(1),j z1 = a 3 y3 xπj|Z=a(3),j z3 = a 27 y27 xπj|Z=a(27),j z27 = a 6 y6 xπj|Z=b(6),j z6 = b 14 y14 xπj|Z=b(14),j z14 = b 33 y33 xπj|Z=b(33),j z33 = b . . . . . . . . . . . . H0 : Xj ⊥ Y |Z P(Y , Xj|Z)

H0

= P(Y |Z) · P(Xj|Z)

  • r P(Y |Xj, Z)

H0

= P(Y |Z)

slide-34
SLIDE 34

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Technically

◮ use any partition of the feature space for conditioning

slide-35
SLIDE 35

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Technically

◮ use any partition of the feature space for conditioning ◮ here: use binary partition already learned by tree

(use cutpoints as bisectors of feature space)

slide-36
SLIDE 36

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Technically

◮ use any partition of the feature space for conditioning ◮ here: use binary partition already learned by tree

(use cutpoints as bisectors of feature space)

◮ condition on correlated variables or select some

slide-37
SLIDE 37

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Technically

◮ use any partition of the feature space for conditioning ◮ here: use binary partition already learned by tree

(use cutpoints as bisectors of feature space)

◮ condition on correlated variables or select some

Strobl et al. (2008) available in cforest from version 0.9-994: varimp(obj, conditional = TRUE)

slide-38
SLIDE 38

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Simulation study

◮ dgp: yi = β1 ·xi,1 +· · ·+β12 ·xi,12 +εi, εi i.i.d.

∼ N(0, 0.5)

◮ X1, . . . , X12 ∼ N(0, Σ)

Σ =                1 0.9 0.9 0.9 · · · 0.9 1 0.9 0.9 · · · 0.9 0.9 1 0.9 · · · 0.9 0.9 0.9 1 · · · 1 · · · . . . . . . . . . . . . . . . ... 1               

Xj X1 X2 X3 X4 X5 X6 X7 X8 · · · X12 βj 5 5 2

  • 5
  • 5
  • 2

· · ·

slide-39
SLIDE 39

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Results

mtry = 1

  • 5

15 25 mtry = 3

  • 10

30 50 mtry = 8

  • 1

2 3 4 5 6 7 8 9 10 11 12 20 40 60 80

variable

slide-40
SLIDE 40

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Peptide-binding data

0.005 unconditional 0.005 conditional h2y8 flex8 pol3 *

slide-41
SLIDE 41

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Summary

slide-42
SLIDE 42

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Summary

if your predictor variables are of different types: use cforest (pkg: party) with default option controls = cforest unbiased() with permutation importance varimp(obj)

slide-43
SLIDE 43

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Summary

if your predictor variables are of different types: use cforest (pkg: party) with default option controls = cforest unbiased() with permutation importance varimp(obj)

  • therwise: feel free to use cforest (pkg: party)

with permutation importance varimp(obj)

  • r randomForest (pkg: randomForest)

with permutation importance importance(obj, type=1)

  • r Gini importance importance(obj, type=2)

but don’t fall for the z-score! (i.e. set scale=FALSE)

slide-44
SLIDE 44

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Summary

if your predictor variables are of different types: use cforest (pkg: party) with default option controls = cforest unbiased() with permutation importance varimp(obj)

  • therwise: feel free to use cforest (pkg: party)

with permutation importance varimp(obj)

  • r randomForest (pkg: randomForest)

with permutation importance importance(obj, type=1)

  • r Gini importance importance(obj, type=2)

but don’t fall for the z-score! (i.e. set scale=FALSE) if your predictor variables are highly correlated: use the conditional importance in cforest (pkg: party)

slide-45
SLIDE 45

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

slide-46
SLIDE 46

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Archer, K. J. and R. V. Kimes (2008). Empirical characterization

  • f random forest variable importance measures. Computational

Statistics & Data Analysis 52(4), 2249–2260. Breiman, L. (2001). Random forests. Machine Learning 45(1), 5–32. Breiman, L. and A. Cutler (2008). Random forests – Classification

  • manual. Website accessed in 1/2008;

http://www.math.usu.edu/∼adele/forests. Breiman, L., A. Cutler, A. Liaw, and M. Wiener (2006). Breiman and Cutler’s Random Forests for Classification and Regression. R package version 4.5-16. Diaz-Uriarte, R. (2007). GeneSrF and varselrf: A web-based tool and R package for gene selection and classification using random forest. BMC Bioinformatics 8:328.

slide-47
SLIDE 47

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary References

Hothorn, T., K. Hornik, and A. Zeileis (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics 15(3), 651–674. Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8:25. Strobl, C. and A. Zeileis (2008). Danger: High power! – exploring the statistical properties of a test for random forest variable

  • importance. In Proceedings of the 18th International

Conference on Computational Statistics, Porto, Portugal. Strobl, C., A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008). Conditional variable importance for random forests. BMC Bioinformatics 9:307.