RECSM Summer School: Machine Learning for Social Sciences
Session 2.1: Introduction to Classification and Regression Trees Reto Wüest
Department of Political Science and International Relations University of Geneva
RECSM Summer School: Machine Learning for Social Sciences Session - - PowerPoint PPT Presentation
RECSM Summer School: Machine Learning for Social Sciences Session 2.1: Introduction to Classification and Regression Trees Reto West Department of Political Science and International Relations University of Geneva Outline 1 The Basics of
Department of Political Science and International Relations University of Geneva
1/29
1 The Basics of Decision Trees 2 Regression Trees
Example: Baseball Salary Data Terminology for Trees Interpretation of Trees Building a Regression Tree Tree Pruning
3 Classification Trees
Building a Classification Tree
2/29
3/29
4/29
5/29
Regression Tree Fit to Baseball Salary Data
| Years < 4.5 Hits < 117.5 5.11 6.00 6.74 Years Hits
1 117.5 238
1 4.5 24
R1 R3 R2 (Source: James et al. 2013, 304f.)
6/29
7/29
Regression Tree Fit to Baseball Salary Data
| Years < 4.5 Hits < 117.5 5.11 6.00 6.74
8/29
1 Divide the predictors space (i.e., the set of possible values for
2 Make the same prediction for every test observation that falls
9/29
J
10/29
the predictor space;
pick a split that will lead to a better tree in some future step).
11/29
12/29
13/29
Decision Tree
|
R1 R2 R3 R4 R5 X1 ≤ t1 X2 ≤ t2 X1 ≤ t3 X2 ≤ t4
Predictor Space
t1 t2 t3 t4 R1 R2 R3 R4 R5 X1 X2
Prediction Surface
X1 X Y
2
(Source: James et al. 2013, 308)
14/29
15/29
16/29
|T|
17/29
18/29
1 Use recursive binary splitting to grow a large tree on the training
data.
2 Apply cost complexity pruning to the large tree in order to obtain a
sequence of best subtrees, as a function of α.
3 Use K-fold CV to choose α. That is, divide the training
(a) Repeat Steps 1 and 2 on all but the kth fold of the training data. (b) Evaluate the mean squared prediction error on the data in the left-out kth fold, as a function of α. Average the results for each value of α, and choose α to minimize the average error.
4 Return the subtree from Step 2 that corresponds to the chosen
value of α.
19/29
Fitting and Pruning a Regression Tree on the Baseball Salary Data Unpruned Tree
|
Years < 4.5 RBI < 60.5 Putouts < 82 Years < 3.5 Years < 3.5 Hits < 117.5 Walks < 43.5 Runs < 47.5 Walks < 52.5 RBI < 80.5 Years < 6.5 5.487 6.407 6.549 4.622 5.183 5.394 6.189 6.015 5.571 6.459 7.007 7.289
Pruned Tree
| Years < 4.5 Hits < 117.5 5.11 6.00 6.74
(Source: James et al. 2013, 304 & 310)
20/29
Fitting and Pruning a Regression Tree on the Baseball Salary Data
2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 Tree Size Mean Squared Error Training Cross−Validation Test
(Green curve shows the CV error associated with α and, therefore, the number of terminal nodes; orange curve shows the test error; black curve shows the training error curve. Source: James et al. 2013, 311)
The CV error is a reasonable approximation of the test error. The CV error takes on its minimum for a three-node tree (see previous slide).
21/29
22/29
23/29
k (ˆ
24/29
K
25/29
K
26/29
27/29
Data for 303 patients with chest pain. Output variables takes a value of Yes if a patient has a heart disease and a value of No if the patient has no heart
| Thal:a Ca < 0.5 MaxHR < 161.5 RestBP < 157 Chol < 244 MaxHR < 156 MaxHR < 145.5 ChestPain:bc Chol < 244 Sex < 0.5 Ca < 0.5 Slope < 1.5 Age < 52 Thal:b ChestPain:a Oldpeak < 1.1 RestECG < 1 No No Yes No No Yes No No No Yes No No Yes Yes No Yes Yes Yes
(Source: James et al. 2013, 313)
28/29
Fitting and Pruning a Classification Tree on the Heart Disease Data
5 10 15 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Tree Size Error Training Cross−Validation Test
| Thal:a Ca < 0.5 MaxHR < 161.5 ChestPain:bc Ca < 0.5 No No No Yes Yes Yes
(Source: James et al. 2013, 313)
29/29