RECSM Summer School: Machine Learning for Social Sciences
Session 2.1: Introduction to Classification and Regression Trees Reto Wüest
Department of Political Science and International Relations University of Geneva
1
RECSM Summer School: Machine Learning for Social Sciences Session - - PowerPoint PPT Presentation
RECSM Summer School: Machine Learning for Social Sciences Session 2.1: Introduction to Classification and Regression Trees Reto West Department of Political Science and International Relations University of Geneva 1 The Basics of Decision
1
1
| Years < 4.5 Hits < 117.5 5.11 6.00 6.74 Years Hits
1 117.5 238
1 4.5 24
R1 R3 R2 (Source: James et al. 2013, 304f.)
2
3
| Years < 4.5 Hits < 117.5 5.11 6.00 6.74
4
1 Divide the predictor space (i.e., the set of possible values for
2 Make the same prediction for every test observation that falls
5
j=1, such that they minimize the RSS J
6
7
8
9
|
R1 R2 R3 R4 R5 X1 ≤ t1 X2 ≤ t2 X1 ≤ t3 X2 ≤ t4
t1 t2 t3 t4 R1 R2 R3 R4 R5 X1 X2
X1 X Y
2
(Source: James et al. 2013, 308)
10
11
12
|T|
13
14
1 Use recursive binary splitting to grow a large tree T0 on the training
2 Apply cost complexity pruning to T0 in order to obtain a sequence
3 Use K-fold CV to choose the optimal α. That is, divide the training
4 Return the subtree from Step 2 that corresponds to the chosen
15
|
Years < 4.5 RBI < 60.5 Putouts < 82 Years < 3.5 Years < 3.5 Hits < 117.5 Walks < 43.5 Runs < 47.5 Walks < 52.5 RBI < 80.5 Years < 6.5 5.487 6.407 6.549 4.622 5.183 5.394 6.189 6.015 5.571 6.459 7.007 7.289
| Years < 4.5 Hits < 117.5 5.11 6.00 6.74
(Source: James et al. 2013, 304 & 310)
16
2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 Tree Size Mean Squared Error Training Cross−Validation Test
(Green curve shows the CV error associated with α and, therefore, the number of terminal nodes; orange curve shows the test error; black curve shows the training error curve. Source: James et al. 2013, 311)
17
18
k
19
K
20
K
21
22
| Thal:a Ca < 0.5 MaxHR < 161.5 RestBP < 157 Chol < 244 MaxHR < 156 MaxHR < 145.5 ChestPain:bc Chol < 244 Sex < 0.5 Ca < 0.5 Slope < 1.5 Age < 52 Thal:b ChestPain:a Oldpeak < 1.1 RestECG < 1 No No Yes No No Yes No No No Yes No No Yes Yes No Yes Yes Yes
(Source: James et al. 2013, 313)
23
5 10 15 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Tree Size Error Training Cross−Validation Test
| Thal:a Ca < 0.5 MaxHR < 161.5 ChestPain:bc Ca < 0.5 No No No Yes Yes Yes
(Source: James et al. 2013, 313)
24
25