Bayesian Classification and Regression Trees James Cussens York - - PowerPoint PPT Presentation

bayesian classification and regression trees
SMART_READER_LITE
LIVE PREVIEW

Bayesian Classification and Regression Trees James Cussens York - - PowerPoint PPT Presentation

Bayesian Classification and Regression Trees James Cussens York Centre for Complex Systems Analysis & Dept of Computer Science University of York, UK 1 Outline Bayesian C&RT Problems for Bayesian C&RT Lessons from


slide-1
SLIDE 1

Bayesian Classification and Regression Trees

James Cussens York Centre for Complex Systems Analysis & Dept of Computer Science University of York, UK

1

slide-2
SLIDE 2

Outline

  • Bayesian C&RT
  • Problems for Bayesian C&RT
  • Lessons from Bayesian phylogeny
  • Results

Bayesian C&RT 2

slide-3
SLIDE 3
  • Bayesian C&RT
  • Problems for Bayesian C&RT
  • Lessons from Bayesian phylogeny
  • Results

Bayesian C&RT 3

slide-4
SLIDE 4

Trees are partition models

1:6 3:3 >28.7 2:108/25 =< 28.7 4:2 =< 98 25:5/3 >98 5:8 =< 127 12:2 >127 7:2 >29 6:104/7 =< 29 8:3 =< 100 11:25/42 >100 9:1/6 =< 56 10:37/7 >56 13:3 =< 155 24:7/43 >155 14:7 =< 72 21:8 >72 15:4 =< 1.162 20:4/1 >1.162 16:8 =< 39 19:2/4 >39 17:5/26 =< 46 18:3/2 >46 22:23/4 =< 44 23:0/6 >44

  • 238.46910030049432

best_llhood:vst(4):msclf(89)

Bayesian C&RT 4

slide-5
SLIDE 5

Classification trees as probability models

  • Tree structure T partitions the attribute space.
  • Each partition (= leaf) i has its own class distribution with

θi = (pi1, . . . piK). Let Θ = (θ1, θ2, . . . , θb) be the complete parameter vector. for a tree T with b leaves.

  • Let x be the vector of attributes for an example, and y its

class label.

  • (Θ, T) defines a conditional probability model P(y|Θ, T, x).

Bayesian C&RT 5

slide-6
SLIDE 6

The Bayesian approach

  • Given

– Prior distribution P(Θ, T) = P(T)P(Θ|T) – Data (X, Y )

  • Compute

– Posterior distribution P(Θ, T|X, Y ) – We just care about structure: P(T|X, Y ) ∝ P(T|X)P(Y |T, X)

Bayesian C&RT 6

slide-7
SLIDE 7

Defining tree structure priors with a sampler Instead of specifying a closed-form expression for the tree prior, P(T|X), we specify P(T|X) implicitly by a tree-generating stochastic process. Each realization of such a process can simply be considered a random draw from this prior. (Chipman et al, JASA, 1998)

  • Grow by splitting leaves η with a probability α(1 + dη)−β,

where dη is the depth of η.

  • Splitting rules chosen uniformly.

Bayesian C&RT 7

slide-8
SLIDE 8

Sampling (approximately) from the posterior

  • Produce an approximate sample from the posterior

P(T|X, Y ).

  • Generate a Markov chain using the Metropolis-Hastings

algorithm.

  • If at tree T propose T ′ with probability q(T ′|T) and accept

T ′ with probability α(T, T ′). α(T, T ′) = min

  • P(T ′|X, Y )

P(T|X, Y ) q(T|T ′) q(T ′|T), 1

  • Bayesian C&RT

8

slide-9
SLIDE 9

Our proposals

  • We propose a new T ′ by pruning T (i) at a random node and

re-growing according to the prior, giving: α(T (i), T ′) = min

dT (i)

dT ′ P(Y |T ′, X) P(Y |T (i), X), 1

  • where dt is the depth of T.
  • So big ‘jumps’ are possible.

Bayesian C&RT 9

slide-10
SLIDE 10

Sometimes it’s easy Kyphosis dataset (81 datapoints, 3 attributes, 2 classes) 50000 MCMC iterations, no tempering: Tree ˆ pseed1(Ti) ˆ pseed2(Ti) ˆ pseed3(Ti) T1 0.08326 0.07898 0.08338 T2 0.05900 0.06154 0.06170 T3 0.05574 0.05664 0.05610 T4 0.02466 0.02724 0.02790 T5 0.02564 0.02674 0.02504 T6 0.01494 0.01682 0.01530 T7 0.01390 0.01410 0.01524 T8 0.01208 0.01324 0.01288 T9 0.01212 0.01284 0.01168

Bayesian C&RT 10

slide-11
SLIDE 11

Computing class probabilities for new data Given training data (X, Y ), the posterior probability that x′ has class y′ is: p(y′|x′, X, Y ) =

  • T

P(T|X, Y )

  • p(y′|x′, Θ, T)P(Θ|T, X, Y )dΘ

We use the MCMC sample to estimate P(T|X, Y ), the rest is analytically soluble.

Bayesian C&RT 11

slide-12
SLIDE 12

Comparing class probabilities in an easy case

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i50K__s) 512 vs. 883

Dataset=K, iterations=50,000, tempering=FALSE

Bayesian C&RT 12

slide-13
SLIDE 13
  • Bayesian C&RT
  • Problems for Bayesian C&RT
  • Lessons from Bayesian phylogeny
  • Results

Bayesian C&RT 13

slide-14
SLIDE 14

Usually it’s not easy . . . the algorithm gravitates quickly towards [regions

  • f large posterior probability] and then stabilizes,

moving locally in that region for a long time. Evidently, this is a consequence of a proposal distribution that makes local moves over a sharply peaked multimodal

  • posterior. Once a tree has reasonable fit, the chain is

unlikely to move away from a sharp local mode by small

  • steps. . . . Although different move types might be

implemented, we believe that any MH algorithm for CART models will have difficulty moving between local

  • modes. (Chipman et al, 1998)

Bayesian C&RT 14

slide-15
SLIDE 15

Where there is room for improvement

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i250K__s) 447 vs. 938

Dataset=BCW, iterations=250,000, tempering=F

Bayesian C&RT 15

slide-16
SLIDE 16

Where there is room for improvement

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i250K__s) 938 vs. 447

Dataset=BCW, iterations=250,000, tempering=F

Bayesian C&RT 16

slide-17
SLIDE 17
  • Bayesian C&RT
  • Problems for Bayesian C&RT
  • Lessons from Bayesian phylogeny
  • Results

Bayesian C&RT 17

slide-18
SLIDE 18

The same problem for Bayesian phylogeny The posterior probability of trees can contain multiple

  • peaks. . . . MCMC can be prone to entrapment in local
  • ptima; a Markov chain currently exploring a peak of

high probability may experience difficulty crossing valleys to explore other peaks. (Altekar et al, 2004) MrBayes is at http://morphbank.ebc.uu.se/mrbayes/

Bayesian C&RT 18

slide-19
SLIDE 19

A solution: (power) tempering

  • As well as the ‘cold’ chain with stationary distribution

P(T|X, Y ),

  • Have ‘hot’ chains with stationary distributions P(T|X, Y )β

for 0 < β < 1

  • And swap states between chains.
  • Only states visited by the cold chain count.

Bayesian C&RT 19

slide-20
SLIDE 20

Acceptance probabilities for tempering αβ

uc(T (i), T ′) = min

  

dT (i) dT ′

  • P(Y |T ′, X)

P(Y |T (i), X)

β

, 1

  

αswap = min

  

  • P(Y |T2, X)

P(Y |T1, X)

(β1−β2)

, 1

  

Bayesian C&RT 20

slide-21
SLIDE 21
  • Bayesian C&RT
  • Problems for Bayesian C&RT
  • Lessons from Bayesian phylogeny
  • Results

Bayesian C&RT 21

slide-22
SLIDE 22

The small print

  • Copied MrBayes defaults: βi = 1/(1 + ∆T(i − 1)) for

i = 1, 2, 3, 4, where ∆T = 0.2.

Bayesian C&RT 22

slide-23
SLIDE 23

Datasets Name Size |x| |Y | Pos%(Tr) Pos%(HO) K 81 3 2 81.5% 68.8% BCW 683 9 2 66.2% 60.3% PIMA 768 8 2 65.4% 64.1% LR 20000 16 26 3.85% 4.3% WF 5000 40 3 35.6% 33.4% Holdout set (HO) is 20% of the data.

Bayesian C&RT 23

slide-24
SLIDE 24

BCW: 50K, Temp=F vs 50K, Temp=T

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i50K__s) 512 vs. 209

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i50K__s) 512 vs. 209

Bayesian C&RT 24

slide-25
SLIDE 25

PIMA: 50K, Temp=F vs 50K, Temp=T

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i50K__s) 883 vs. 512

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i50K__s) 883 vs. 512

Bayesian C&RT 25

slide-26
SLIDE 26

PIMA: 250K, Temp=F vs 50K, Temp=T

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i250K__s) 938 vs. 447

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i50K__s) 883 vs. 512

Bayesian C&RT 26

slide-27
SLIDE 27

PIMA: 250K, Temp=F vs 250K, Temp=T

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i250K__s) 938 vs. 447

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i250K__s) 447 vs. 938

Bayesian C&RT 27

slide-28
SLIDE 28

LR: 50K, Temp=F vs 50K, Temp=T

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i50K__s) 512 vs. 209

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i50K__s) 883 vs. 209

Bayesian C&RT 28

slide-29
SLIDE 29

WF: 50K, Temp=F vs 50K, Temp=T

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i50K__s) 512 vs. 209

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i50K__s) 883 vs. 209

Bayesian C&RT 29

slide-30
SLIDE 30

Stability of classification accuracy on hold-out set for 3 MCMC runs with and without tempering Temp=F Temp=T Data acc σacc acc σacc rpart Time per 1000 K 68.8% 0.0% 68.8% 0.0% 75.0% 5s BCW 96.1% 1.2% 95.8% 0.3% 95.5% 17s PIMA 76.9% 3.2% 73.6% 1.6% 76.4% 129s LR 62.4% 3.6% 66.9% 0.1% 46.1% 2368s WF 71.0% 3.7% 72.5% 2.9% 74.1% 1151s

Bayesian C&RT 30

slide-31
SLIDE 31

Materials

  • This SLP and other materials used available from

http://www-users.cs.york.ac.uk/aig/slps/mcmcms/

  • Look in the pbl/icml05 directory of the MCMCMS

distribution.

  • Includes scripts for reproducing the figures in this paper.

Bayesian C&RT 31

slide-32
SLIDE 32

Future work

  • Tempering plus informative priors.
  • Currently applying to MCMC for Bayesian nets.

Bayesian C&RT 32

slide-33
SLIDE 33

Bayesian Additive Regression Trees BART uses a sum-of-trees model: Y ∼ g1(x) + g2(x) + . . . + gm(x) + ǫ, ǫ ∼ N(0, σ2) where each gi is a regression tree.

  • Do MCMC by ‘Gibbs sampling’: get a new tree for gi

conditional on all the others.

  • The distribution for the new tree only depends on the

residual produced from the other trees.

Bayesian C&RT 33

slide-34
SLIDE 34

Getting that posterior

  • Bayes theorem: P(Θ, T|X, Y ) ∝ P(Θ, T|X)P(Y |Θ, T, X)
  • We restrict attention to tree structures, so integrate away

the parameters Θ: P(T|X, Y ) = P(T|X)

  • Θ P(Y |Θ, T, X)P(Θ|T, X)dΘ

= P(T|X)P(Y |T, X)

  • The marginal likelihood P(Y |T, X) is easy to compute . . .
  • . . . since we use Dirichlet distributions for P(Θ|T, X) (and
  • ther standard assumptions).

Bayesian C&RT 34

slide-35
SLIDE 35

Comparing class probabilities in an easy case

0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i50K__s) 883 vs. 209

Dataset=K, iterations=50,000, tempering=FALSE

Bayesian C&RT 35

slide-36
SLIDE 36

BCW: 50K, Temp=F vs 50K, Temp=T

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i50K__s) 883 vs. 512

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 (tr_uc_rm_idsd_a0_95b1_i50K__s) 883 vs. 512

Bayesian C&RT 36