 
              . . . . . . . . . . . . . . . . . Lecture 24: Other (Non-linear) Classifjers: Decision Tree Learning, Boosting, and Support Vector Classifjcation Instructor: Prof. Ganesh Ramakrishnan October 20, 2016 . . . . . . . . . . . . . . . . . . . . . . . 1 / 25
. Humidity . . . . . . . . Decision Trees: Cascade of step functions on individual features Outlook Wind Yes . No Yes No Yes rain sunny overcast high normal strong weak October 20, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 / 25
. . . . . . . . . . . . . . . . . . Use cases for Decision Tree Learning October 20, 2016 . . . . . . . . . . . . . . . . . . . . . . 3 / 25
. Overcast Cool Sunny D9 No Weak High Mild Sunny D8 Yes Strong Normal Cool D7 Weak No Strong Normal Cool Rain D6 Yes Weak Normal Cool Rain D5 Yes Weak Normal Yes Mild Strong October 20, 2016 No Strong High Mild Rain D14 Yes Weak Normal Hot Overcast D13 Yes High D10 Mild Overcast D12 Yes Strong Normal Mild Sunny D11 Yes Weak Normal Mild Rain High Rain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D4 High Yes Weak High Hot Overcast D3 No Strong High Hot Sunny D2 No Weak Hot . Sunny D1 PlayTennis Wind Humidity Temperature Outlook Day The Canonical Playtennis Dataset . . . . . 4 / 25
. . . . . . . . . . . . . . . Decision tree representation Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classifjcation How would we represent: M of N October 20, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . 5 / 25 ∧ , ∨ , XOR ( A ∧ B ) ∨ ( C ∧ ¬ D ∧ E )
Answer : That which brings about maximum reduction in impurity Imp S v of the data subset S is a sample of training examples, p C i is proportion of examples belonging to class C i in S p C i log p C i i = expected reduction in entropy due to splitting/sorting on S H S v . Which attribute is best? induced by S v Top-Down Induction of Decision Trees Main loop: . . . . . i Entropy measures impurity of S : H S v . . K i Gain S i Gain S i H S v Values i S v October 20, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 / 25 . . . . 1 φ i ← the “best” decision attribute for next node 2 Assign φ i as decision attribute for node 3 For each value of φ i , create new descendant of node 4 Sort training examples to leaf nodes 5 If training examples perfectly classifjed, Then STOP, Else iterate over new leaf nodes
S is a sample of training examples, p C i is proportion of examples belonging to class C i in S p C i log p C i i = expected reduction in entropy due to splitting/sorting on S H S v . . . . . . . . Top-Down Induction of Decision Trees Main loop: Which attribute is best? i Entropy measures impurity of S : H S K . Gain S i Gain S i H S v Values i S v October 20, 2016 . . . . . . . . . . . . . . . . . . . 6 / 25 . . . . . . . . . . . . 1 φ i ← the “best” decision attribute for next node 2 Assign φ i as decision attribute for node 3 For each value of φ i , create new descendant of node 4 Sort training examples to leaf nodes 5 If training examples perfectly classifjed, Then STOP, Else iterate over new leaf nodes Answer : That which brings about maximum reduction in impurity Imp ( S v ) of the data subset S v ⊆ D induced by φ i = v .
. . . . . . . . . . . . . . . . Top-Down Induction of Decision Trees Main loop: Which attribute is best? K October 20, 2016 . . . . . . . . . . . . . 6 / 25 . . . . . . . . . . . 1 φ i ← the “best” decision attribute for next node 2 Assign φ i as decision attribute for node 3 For each value of φ i , create new descendant of node 4 Sort training examples to leaf nodes 5 If training examples perfectly classifjed, Then STOP, Else iterate over new leaf nodes Answer : That which brings about maximum reduction in impurity Imp ( S v ) of the data subset S v ⊆ D induced by φ i = v . S is a sample of training examples, p C i is proportion of examples belonging to class C i in S ∑ Entropy measures impurity of S : H ( S ) ≡ − p C i log 2 p C i i =1 Gain ( S , φ i ) = expected reduction in entropy due to splitting/sorting on φ i | S v | Gain ( S , φ i ) ≡ H ( S ) − ∑ | S | H ( S v ) v ∈ Values ( φ i )
. . . . . . . . . . . . Common Impurity Measures (Tutorial 9) . Name Entropy K Gini Index K Class (Min Prob) Error argmin Table: Decision Tree: Impurity measurues These measure the extent of spread /confusion of the probabilities over the classes October 20, 2016 . . . . . . . . . . . . . . . 7 / 25 . . . . . . . . . . . . | S v ij | ( ) ∑ φ s = arg max Imp ( S ) − | S | Imp ( S v ij ) V ( φ i ) ,φ i v ij ∈ V ( φ i ) where S ij ⊆ D is a subset of dataset such that each instance x has attribute value φ i ( x ) = v ij . Imp ( S ) ∑ − Pr ( C i ) • log ( Pr ( C i )) i =1 ∑ Pr ( C i )(1 − Pr ( C i )) i =1 i (1 − Pr ( C i ))
. . . . . . . . . . . . . . . . Alternative impurity measures (Tutorial 9) Figure: Plot of Entropy, Gini Index and Misclassifjcation Accuracy. Source: https://inspirehep.net/record/1225852/files/TPZ_Figures_impurity.png These measure the extent of spread/confusion of the probabilities over the classes October 20, 2016 . . . . . . . . . . . . . . . . . . . . . . . . 8 / 25
Structural Regularization 2 based on Occam’s razor 3 size misclassifications val tree . Use parametric/non-parametric hypothesis tests . . . . . . . Regularization in Decision Tree Learning Premise: Split data into train and validation set 1 1 stop growing when data split not statistically signifjcant 2 . grow full tree, then post-prune tree Minimum Description Length (MDL): minimize size tree Achieved as follows: Do until further pruning is harmful (1) Evaluate impact on validation set of pruning each possible node (plus those below it) (2) Greedily remove the one that most improves validation set accuracy 3 convert tree into a set of rules and post-prune each rule independently (C4.5 Decision Tree Learner) 1 Note: The test set still remains separate 2 Like we discussed in the case of Convolutional Neural Networks 3 Prefer the shortest hypothesis that fjts the data October 20, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 / 25
size misclassifications val tree . stop growing when data split not statistically signifjcant . . . . . . . . Regularization in Decision Tree Learning Premise: Split data into train and validation set 1 1 grow full tree, then post-prune tree 2 . Minimum Description Length (MDL): minimize size tree Achieved as follows: Do until further pruning is harmful (1) Evaluate impact on validation set of pruning each possible node (plus those below it) (2) Greedily remove the one that most improves validation set accuracy 3 convert tree into a set of rules and post-prune each rule independently (C4.5 Decision Tree Learner) 1 Note: The test set still remains separate 2 Like we discussed in the case of Convolutional Neural Networks 3 Prefer the shortest hypothesis that fjts the data October 20, 2016 . . . . . . . . . . . . . . . . . 9 / 25 . . . . . . . . . . . . . Structural Regularization 2 based on Occam’s razor 3 ⋆ Use parametric/non-parametric hypothesis tests
. 1 . . . . . . . . . Regularization in Decision Tree Learning Premise: Split data into train and validation set 1 stop growing when data split not statistically signifjcant . 2 grow full tree, then post-prune tree (1) Evaluate impact on validation set of pruning each possible node (plus those below it) (2) Greedily remove the one that most improves validation set accuracy 3 convert tree into a set of rules and post-prune each rule independently (C4.5 Decision Tree Learner) 1 Note: The test set still remains separate 2 Like we discussed in the case of Convolutional Neural Networks 3 Prefer the shortest hypothesis that fjts the data October 20, 2016 . . . . . . . . . . . . . . . . . 9 / 25 . . . . . . . . . . . . Structural Regularization 2 based on Occam’s razor 3 ⋆ Use parametric/non-parametric hypothesis tests ⋆ Minimum Description Length (MDL): minimize size ( tree ) + size ( misclassifications val ( tree )) ⋆ Achieved as follows: Do until further pruning is harmful
Recommend
More recommend