Chapter 5. Tree-based Methods Wei Pan Division of Biostatistics, - - PowerPoint PPT Presentation

chapter 5 tree based methods
SMART_READER_LITE
LIVE PREVIEW

Chapter 5. Tree-based Methods Wei Pan Division of Biostatistics, - - PowerPoint PPT Presentation

Chapter 5. Tree-based Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 Wei Pan c Regression And Classification Tree (CART)


slide-1
SLIDE 1

Chapter 5. Tree-based Methods

Wei Pan

Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu

PubH 7475/8475 c Wei Pan

slide-2
SLIDE 2

Regression And Classification Tree (CART)

◮ §9.2: Breiman et al (1984).

≈ C4.5 (Quinlan 1993).

◮ Main idea: approximate any f (x) by a piece-wise constant

ˆ f (x).

◮ Use recursive partitioning: Fig 9.2,

1) Partition the x space into two regions R1 and R2 by xj < cj; 2) Partition R1, R2; 3) Then their sub-regions, ... until the model fits data well.

◮ ˆ

f (x) =

m cmI(x ∈ Rm).

can be represented as a (decision) tree.

slide-3
SLIDE 3

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9 |

t1 t2 t3 t4 R1 R1 R2 R2 R3 R3 R4 R4 R5 R5 X1 X1 X1 X2 X2 X2 X1 ≤ t1 X2 ≤ t2 X1 ≤ t3 X2 ≤ t4

FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a two-dimensional feature space by recursive binary splitting, as used in CART, applied to some fake data. Top left panel shows a general partition that cannot be obtained from recursive binary splitting.

slide-4
SLIDE 4

Regression Tree

◮ Y : continuous. ◮ Key: 1) determin splitting variables and split points (e.g.

xj < tj); = ⇒ R1, R2, ...; 2) determine cm in each Rm.

◮ in 1), use a sequential or greedy searchfor each j and s: find

xj < s s.t. R1(j, s) = {x|xj < s}, R2(j, s) = {x|xj ≥ s}, minj,s[minc1

  • Xi∈R1(j,s)(Yi−c1)2+minc2
  • Xi∈R2(j,s)(Yi−c2)2].

◮ in 2), given R1 and R2,

ˆ ck = Ave(Yi|Xi ∈ Rk} for k = 1, 2.

◮ Repeat the process on R1 and R2 respectively, ... ◮ When to stop?

Have to stop when having all equal or too few Yi’s in Rm; Tree size gives a model complexity!

slide-5
SLIDE 5

◮ A strategy: first grow a large tree, then prune it. ◮ Cost-complexity criterion for tree T:

Cα(T) = RSS(T) + α|T| =

  • m
  • Xi∈Rm

(Yi − ˆ cm)2 + α|T|, where |T| is # of terminal nodes (leaves) and α > 0 is a tuning parameter to be determined by CV.

slide-6
SLIDE 6

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9

10 20 30 40 0.0 0.1 0.2 0.3 0.4 Tree Size Misclassification Rate 176 21 7 5 3 2

α FIGURE 9.4. Results for spam example. The blue curve is the 10-fold cross-validation estimate of mis- classification rate as a function of tree size, with stan- dard error bars. The minimum occurs at a tree size with about 17 terminal nodes (using the “one-standard-

  • error” rule). The orange curve is the test error, which

tracks the CV error quite closely. The cross-validation is indexed by values of α, shown above. The tree sizes shown below refer to |Tα|, the size of the original tree indexed by α.

slide-7
SLIDE 7

Classification Tree

◮ Yi ∈ {1, 2, ..., K}. ◮ Classify obs’s in node m to the majority class:

ˆ pmk =

Xi∈Rm I(Yi = k)/nm,

k(m) = arg maxk ˆ pmk.

◮ Impurity measure Qm(T):

Used squarted error in regression trees.

  • 1. Misclassification error:

1 nm

  • Xi∈Rm I(Yi = k(m)) = 1 − ˆ

pm,k(m).

  • 2. Gini index: K

k=1 ˆ

pmk(1 − ˆ pmk).

  • 3. Cross-entropy or deviance: K

k=1 ˆ

pmk log ˆ pmk.

◮ For K = 2, 1-3 reduce to 1 − max(ˆ

p, 1 − ˆ p), 2ˆ p(1 − ˆ p), −ˆ p log ˆ p − (1 − ˆ p) log(1 − ˆ p). Look similar; see Fig 9.3.

◮ Example: ex5.1.r

slide-8
SLIDE 8

◮ Advantages:

  • 1. Easy to incorporate unequal losses of misclassifications:

1 nm

  • Xi∈Rm wiI(Yi = k(m)) with wi = Ck if Yi = k.
  • 2. Handling missing data: use a surrogate splitting var/value

at each node (to best approximate the selected one).

◮ Extensions:

  • 1. May use non-binary splits;
  • 2. A linear combination of multiple var’s as a splitting var.

more flexible, but better?

◮ +: easy interpretation –decision trees!

  • : unstable due to greedy search and discontinuity; predicting

performance not best.

◮ R packages tree, rpart; commercial CART. ◮ Other implementations: C4.5/C5.0;

FIRM by Prof Hawkins (U of M): to detect interactions; by Prof Loh’s group (UW-Madison): for count, survival, ... data; regression in each terminal node; ...

slide-9
SLIDE 9

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9

600/1536 280/1177 180/1065 80/861 80/652 77/423 20/238 19/236 1/2 57/185 48/113 37/101 1/12 9/72 3/229 0/209 100/204 36/123 16/94 14/89 3/5 9/29 16/81 9/112 6/109 0/3 48/359 26/337 19/110 18/109 0/1 7/227 0/22

spam spam spam spam spam spam spam spam spam spam spam spam email email email email email email email email email email email email email email email email email email email email email ch$<0.0555 remove<0.06 ch!<0.191 george<0.005 hp<0.03 CAPMAX<10.5 receive<0.125 edu<0.045

  • ur<1.2

CAPAVE<2.7505 free<0.065 business<0.145 george<0.15 hp<0.405 CAPAVE<2.907 1999<0.58 ch$>0.0555 remove>0.06 ch!>0.191 george>0.005 hp>0.03 CAPMAX>10.5 receive>0.125 edu>0.045

  • ur>1.2

CAPAVE>2.7505 free>0.065 business>0.145 george>0.15 hp>0.405 CAPAVE>2.907 1999>0.58

FIGURE 9.5. The pruned tree for the spam example.

slide-10
SLIDE 10

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9 Specificity Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

  • • • • ••
  • Tree (0.95)

GAM (0.98) Weighted Tree (0.90)

FIGURE 9.6. ROC curves for the classification rules fit to the spam data. Curves that are closer to the north- east corner represent better classifiers. In this case the GAM classifier dominates the trees. The weighted tree achieves better sensitivity for higher specificity than the unweighted tree. The numbers in the legend represent the area under the curve.

slide-11
SLIDE 11

Application: personalized medicine

◮ Also called subgroup analysis (or Precision Medicine): to

identify subgroups of patients that would be most benefit from a treatment.

◮ Statistical problem: detect (qualitative) trt-predictor

interaction! quantitative interactions: differ in magnitudes but in teh same direction; qualitative interactions: differ in directions.

◮ Many approaches ... one of them is to use trees. ◮ Prof Loh’s GUIDE:

http://www.stat.wisc.edu/∼loh/guide.html

◮ An example:

http://onlinelibrary.wiley.com/doi/10.1002/sim. 6454/abstract

◮ Another example:

https://www.ncbi.nlm.nih.gov/pubmed/24983709