Trees Applied Multivariate Statistics Spring 2012 Overview - - PowerPoint PPT Presentation

trees
SMART_READER_LITE
LIVE PREVIEW

Trees Applied Multivariate Statistics Spring 2012 Overview - - PowerPoint PPT Presentation

Trees Applied Multivariate Statistics Spring 2012 Overview Intuition for Trees Regression Trees Classification Trees 1 Idea of Trees: Regression Trees Continuous response Binary Tree y Y=1.2 X 1 X>1 Y=1.4 Y=0.7 2 X


slide-1
SLIDE 1

Trees

Applied Multivariate Statistics – Spring 2012

slide-2
SLIDE 2

Overview

  • Intuition for Trees
  • Regression Trees
  • Classification Trees

1

slide-3
SLIDE 3

Idea of Trees: Regression Trees Continuous response

2

y x Binary Tree 1 2 Y=1.2 1 2 X>1 X≤1 Y=1.4 Y=0.7 Y=0.2 Y=1.9 Y=1.3 Y=0.3 X≤0.3 X>0.3 X≤2 X>2

slide-4
SLIDE 4

Idea of Trees: Classification Tree Discrete response

3

Survived in Titanic? No Sex=F Sex=M 800/200 No Yes No 150/50 No 650/150 Age <35 Age ≥35 Yes No 3/17 147/33 Age <27 Age ≥27 Yes No 70/130 580/20 Missclassification rate:

  • Total: (3+33+70+20) / 1000 = 0.126
  • “Yes”-class: 53/200 = 0.26
  • “No”-class: 73/800 = 0.09
slide-5
SLIDE 5

Intuition of Trees: Recursive Partitioning

4

For simplicity: Restrict to recursive binary splits

slide-6
SLIDE 6

Fighting overfitting: Cost-complexity pruning

5

Test error Training error Complexity of model Overfitting: Fitting the training data perfectly might not be good for predicting future data In practice: Use cross-validation For trees:

  • 1. Fit a very detailed model
  • 2. Prune it using a complexity penalty to optimize cross-validation performance
slide-7
SLIDE 7

Building Regression Trees 1/2

  • Assume given partition of space R1, …, RM

Tree model:

  • Goal is to minimize sum of squared residuals:

(𝑧𝑗 − 𝑔 𝑦𝑗 2)

  • Solution: Average of data points in every region

6

slide-8
SLIDE 8

Building Regression Trees 2/2

  • Finding the best binary partition is computationally

infeasible

  • Use greedy approach: For variable j and split point s define

the two generated regions:

  • Choose splitting variable j and split point s that solve:

inner minimization is solved by

  • Repeat splitting process on each of the two resulting

regions

7

slide-9
SLIDE 9

Pruning Regression Trees

  • Stop splitting when some minimal node size (= nmb. of

samples per node) is reached (e.g. 5)

  • Then, cut back the tree again (“pruning”) to optimize the

cost-complexity criterion:

  • Tuning parameter 𝛽 is chosen by cross-validation

8

“Impurity measure” Goodness of fit Complexity

slide-10
SLIDE 10

Classification Trees

  • Regression Tree:

Quality of split measured by “Squared error”

  • Classification Tree:

Quality of split measured by general “Impurity measure”

9

slide-11
SLIDE 11

Classification Trees: Impurity Measures

  • Proportion of class k observations in node m:
  • Define majority class in node m: k(m)
  • Common impurity measures 𝑅𝑛(𝑈):
  • For just two classes:

10

slide-12
SLIDE 12

Example: Gini Index

11

Side effects after treatment? 100 persons, 50 with and 50 without side effects: 50 / 50 (No / Yes) Split on sex 50 / 50 F M 30 / 40 Gini = 0.49 20 / 10 Gini = 0.44 Total Gini = 0.49 + 0.44 = = 0.93 Split on age 50 / 50 young

  • ld

10 / 50 Gini = 0.27 40 / 0 Gini = 0 Total Gini = 0.27 + 0 = = 0.27 0.27 < 0.93, therefore: Choose split on age

slide-13
SLIDE 13

Classification Trees: Impurity Measures

  • Usually:
  • Gini Index used for building
  • Misclassification error used for pruning

12

slide-14
SLIDE 14

Example: Pruning using Misclass. Error (MCE)

13

50 / 50 young

  • ld

10 / 50 MCE = 0.167 40 / 0 MCE = 0 50 / 50 young

  • ld

10 / 50 MCE = 0.167 40 / 0 MCE = 0 short tall 0 / 50 MCE = 0 10 / 0 MCE = 0 𝐷𝛽 𝑈 = 50 ∗ 0 + 10 ∗ 0 + 40 ∗ 0 + 0.5 ∗ 3 = = 1.5 𝐷𝛽 𝑈 = 60 ∗ 0.167 + 40 ∗ 0 + 0.5 ∗ 2 = = 11.0 Smaller 𝐷𝛽(𝑈), therefore don’t prune e.g., 𝛽 = 0.5

slide-15
SLIDE 15

Trees in R

  • Function “rpart” (recursive partitioning) in package “rpart”

together with “print”, “plot”, “text”

  • Function “rpart” automatically prunes using optimal 𝜷

based on 10-fold CV

  • Functions “plotcp” and “printcp” for cost-complexity

information

  • Function “prune” for manual pruning

14

slide-16
SLIDE 16

Concepts to know

  • Trees as recursive partitionings
  • Concept of cost-complexity pruning
  • Impurity measures

15

slide-17
SLIDE 17

R functions to know

  • From package “rpart”: “rpart”, “print”, “plot”, “text”, “plotcp”,

“printcp”, “prune”

16