Uncovering interactions with Random Forests Jake Michaelson Marit - - PowerPoint PPT Presentation

uncovering interactions with random forests
SMART_READER_LITE
LIVE PREVIEW

Uncovering interactions with Random Forests Jake Michaelson Marit - - PowerPoint PPT Presentation

Uncovering interactions with Random Forests Jake Michaelson Marit Ackermann Andreas Beyer Random Forests >> ensembles of decision trees >> diverse trees trying to solve the same problem >> used frequently for: >>


slide-1
SLIDE 1

Uncovering interactions with Random Forests

Jake Michaelson Marit Ackermann Andreas Beyer

slide-2
SLIDE 2

Random Forests

>> ensembles of decision trees >> diverse trees trying to solve the same problem >> used frequently for: >> prediction (knowledge of model less important) >> feature selection (prediction less important)

slide-3
SLIDE 3

RF interactions: prior art

>> online official RF manual >> Lunetta, et al. (2004) >> Bureau, et al. (2005) >> pairwise permutation importance >> Mao and Mao (2008) >> Jiang, et al. (2009) >> selection with RF Gini importance, conventional

(LM-based) interaction test (up to 3-way)

slide-4
SLIDE 4

200 400 600 800 1000 0.000 0.010 0.020

predictors selection frequency

200 400 600 800 1000 0.000 0.010 0.020

predictors selection frequency

a typical problem

slide-5
SLIDE 5

200 400 600 800 1000 0.000 0.010 0.020

predictors selection frequency

200 400 600 800 1000 0.000 0.010 0.020

predictors selection frequency

a typical problem

slide-6
SLIDE 6

200 400 600 800 1000 0.000 0.010 0.020

predictors selection frequency

200 400 600 800 1000 0.000 0.010 0.020

predictors selection frequency

a typical problem

slide-7
SLIDE 7

200 400 600 800 1000 0.000 0.010 0.020

predictors selection frequency

200 400 600 800 1000 0.000 0.010 0.020

predictors selection frequency

a typical problem

?

slide-8
SLIDE 8

split symmetry

A B B

B

B

B

B B

B

B

B

slide-9
SLIDE 9

split asymmetry

A B B

B

B

B

B B

B B

B

slide-10
SLIDE 10

testing split symmetry

>> independence of predictors A and B: >> expect B as left daughter 50% of the time >> expect B as right daughter 50% of the time >> the prior (a beta density) is centered around

0.5

slide-11
SLIDE 11

testing split symmetry

0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 proportion density

slide-12
SLIDE 12

testing split symmetry

>> we update the prior density parameters with the

  • bserved left/right daughter counts:

>> aposterior = aprior + ABleft >> bposterior = bprior + ABright >> ... and take the posterior/prior density ratio at

0.5

>> this is the Bayes factor

slide-13
SLIDE 13

0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 proportion density

testing split symmetry

slide-14
SLIDE 14

building a graph

>> using the Bayes factor from each pair of predictors,

we calculate the posterior probability of symmetry

>> i.e. that the true proportion is 0.5 >> we use a high prior probability of the hypothesis

(e.g. ph = 0.999999)

slide-15
SLIDE 15

building a graph

A B C D A B C D 1 0.001 0.001 0.3 0.8 1 0.99 0.2 0.99 0.3 1 0.003 1 0.89 0.99 1

posterior probabilities

A B C D A B C D 1 1 1

adjacency matrix A C D B graph

slide-16
SLIDE 16

simulations

>> 1000 binary predictor variables, 200 observations >> 3 - 4 predictors participate in true model >> tested ability of the method to recover the true

topology of the simulated model

>> recorded TP, FP while varying mtry and ntree

slide-17
SLIDE 17

2500 5000 7500 10000 250 500 750 1000

mtry ntree

2500 5000 7500 10000 250 500 750 1000

mtry ntree

test models

A B C 3 independent effects (i.e. no edges) TP FP

slide-18
SLIDE 18
  • 2500

5000 7500 10000 250 500 750 1000

mtry ntree

2500 5000 7500 10000 250 500 750 1000

mtry ntree

test models

A B C 3-way unordered interaction TP FP

slide-19
SLIDE 19

2500 5000 7500 10000 250 500 750 1000

mtry ntree

2500 5000 7500 10000 250 500 750 1000

mtry ntree

test models

  • ne main effect,
  • ne ordered 3-way interaction,
  • ne ordered 2-way interaction

TP FP A B C D

slide-20
SLIDE 20

2500 5000 7500 10000 250 500 750 1000

mtry ntree

2500 5000 7500 10000 250 500 750 1000

mtry ntree

test models

two independent, ordered two-way interactions TP FP A B C D

slide-21
SLIDE 21

real world

>> Gabrb3

>> neurotransmitter

receptor subunit

>> absence (or

misexpression) yields autism-like behavior

>> what mechanisms

influence Gabrb3 expression?

Livet, et al. (2007)

slide-22
SLIDE 22

regulation of Gabrb3

CEL.2_73370728 rs13478123 gnf07.050.858 rs13479274 rs13479276 CEL.7_46763479 rs3693478 rs6166250 rs13481641 rs4220193 rs6375622 rs4221305 rs3164054 rs8274734

grow an RF that regresses hippocampal Gabrb3 expression

  • n the genotypes (m=3,794) of

the same population of mice, then extract the interaction graph

slide-23
SLIDE 23

regulation of Gabrb3

CEL.2_73370728 rs13478123 gnf07.050.858 rs13479274 rs13479276 CEL.7_46763479 rs3693478 rs6166250 rs13481641 rs4220193 rs6375622 rs4221305 rs3164054 rs8274734

L1

grow an RF that regresses hippocampal Gabrb3 expression

  • n the genotypes (m=3,794) of

the same population of mice, then extract the interaction graph

slide-24
SLIDE 24

regulation of Gabrb3

CEL.2_73370728 rs13478123 gnf07.050.858 rs13479274 rs13479276 CEL.7_46763479 rs3693478 rs6166250 rs13481641 rs4220193 rs6375622 rs4221305 rs3164054 rs8274734

L1 L2

grow an RF that regresses hippocampal Gabrb3 expression

  • n the genotypes (m=3,794) of

the same population of mice, then extract the interaction graph

slide-25
SLIDE 25

regulation of Gabrb3

CEL.2_73370728 rs13478123 gnf07.050.858 rs13479274 rs13479276 CEL.7_46763479 rs3693478 rs6166250 rs13481641 rs4220193 rs6375622 rs4221305 rs3164054 rs8274734

L1 L2 L3

grow an RF that regresses hippocampal Gabrb3 expression

  • n the genotypes (m=3,794) of

the same population of mice, then extract the interaction graph

slide-26
SLIDE 26

regulation of Gabrb3

CEL.2_73370728 rs13478123 gnf07.050.858 rs13479274 rs13479276 CEL.7_46763479 rs3693478 rs6166250 rs13481641 rs4220193 rs6375622 rs4221305 rs3164054 rs8274734

L1 L2 L3

grow an RF that regresses hippocampal Gabrb3 expression

  • n the genotypes (m=3,794) of

the same population of mice, then extract the interaction graph

genomic variation Gabrb3 expression L1 L2 L3

slide-27
SLIDE 27

regulation of Gabrb3

CEL.2_73370728 rs13478123 gnf07.050.858 rs13479274 rs13479276 CEL.7_46763479 rs3693478 rs6166250 rs13481641 rs4220193 rs6375622 rs4221305 rs3164054 rs8274734

L1 L2 L3

grow an RF that regresses hippocampal Gabrb3 expression

  • n the genotypes (m=3,794) of

the same population of mice, then extract the interaction graph

genomic variation Gabrb3 expression L1 L2 L3

slide-28
SLIDE 28

regulation of Gabrb3

CEL.2_73370728 rs13478123 gnf07.050.858 rs13479274 rs13479276 CEL.7_46763479 rs3693478 rs6166250 rs13481641 rs4220193 rs6375622 rs4221305 rs3164054 rs8274734

L1 L2 L3

L1 - Gabrb3 (cis effect) L2 - Dscam (axon guidance) L3 - Magi2 (synaptic scaffolding)

genomic variation Gabrb3 expression L1 L2 L3

slide-29
SLIDE 29

the context

slide-30
SLIDE 30

the context

Gabrb3 Magi2 Dscam

slide-31
SLIDE 31

conclusion

>> (a)symmetry of transitions between subsequently

selected variables can give us clues about the degree of dependence between them

>> constructing a graph of these dependencies can

illustrate the emergent dependency structure of the predictors in light of the response

slide-32
SLIDE 32

forthcoming...

>> does this work for continuous and categorical

predictors?

>> what about correlated predictors? >> strategy for choosing optimal mtry and ntree?

slide-33
SLIDE 33

RF is an example of a tool that is useful in doing analyses of scientific data. But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem. Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem.

  • Breiman & Cutler
slide-34
SLIDE 34

Thanks!

jacob.michaelson@biotec.tu-dresden.de