machine learning cse 446 decision trees
play

Machine Learning (CSE 446): Decision Trees Sham M Kakade 2018 c - PowerPoint PPT Presentation

Machine Learning (CSE 446): Decision Trees Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 18 Announcements First assignment posted. Due Thurs, Jan 18th. Remember the late policy (see the website).


  1. Machine Learning (CSE 446): Decision Trees Sham M Kakade � 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 18

  2. Announcements ◮ First assignment posted. Due Thurs, Jan 18th. Remember the late policy (see the website). ◮ TA office hours posted. (Please check website before you go, just in case of changes.) ◮ Midterm: Weds, Feb 7. ◮ Today: Decision Trees, the supervised learning 2 / 18

  3. Features (a conceptual point) Let φ be (one such) function that maps from inputs x to values. There could be many such functions, sometimes we write Φ( x ) for the feature “vector” (it’s really a“tuple”). ◮ If φ maps to { 0 , 1 } , we call it a “binary feature (function).” ◮ If φ maps to R , we call it a “real-valued feature (function).” ◮ φ could map to categorical values. ◮ ordinal values, integers, ... Often, there isn’t much of a difference between x and the tuple of features. 3 / 18

  4. Features Data derived from https://archive.ics.uci.edu/ml/datasets/Auto+MPG mpg; cylinders; displacement; horsepower; weight; acceleration; year; origin Input: a row in this table. a feature mapping corresponds to a column. Goal: predict whether mpg is < 23 (“bad” = 0) or above (“good” = 1) given other attributes (other columns). 201 “good” and 197 “bad”; guessing the most frequent class (good) will get 50.5% accuracy. 4 / 18

  5. Let’s build a classifier! ◮ Let’s just try to build a classifier. (This is our first learning algorithm) ◮ For now, let’s ignore the “test” set and trying to “generalize” ◮ Let’s start with just looking at a simple classifier. What is a simple classification rule? 5 / 18

  6. Contingency Table values of feature φ v 1 v 2 · · · v K values of y 0 1 6 / 18

  7. Decision Stump Example maker y america europe asia 0 174 14 9 1 75 56 70 ↓ ↓ ↓ 0 1 1 7 / 18

  8. Decision Stump Example root 197:201 maker? maker y america europe asia america europe asia 0 174 14 9 174:75 14:56 9:70 1 75 56 70 ↓ ↓ ↓ 0 1 1 7 / 18

  9. Decision Stump Example root 197:201 maker? maker y america europe asia america europe asia 0 174 14 9 174:75 14:56 9:70 1 75 56 70 ↓ ↓ ↓ 0 1 1 Errors: 75 + 14 + 9 = 98 (about 25%) 7 / 18

  10. Decision Stump Example root 197:201 cylinders? 3 4 5 6 8 3:1 20:184 1:2 73:11 100:3 8 / 18

  11. Decision Stump Example root 197:201 cylinders? 3 4 5 6 8 3:1 20:184 1:2 73:11 100:3 Errors: 1 + 20 + 1 + 11 + 3 = 36 (about 9%) 8 / 18

  12. Key Idea: Recursion A single feature partitions the data. For each partition, we could choose another feature and partition further. Applying this recursively, we can construct a decision tree . 9 / 18

  13. Decision Tree Example root 197:201 cylinders? 3 4 5 6 8 3:1 20:184 1:2 73:11 100:3 maker? america europe asia 7:65 10:53 3:66 Error reduction compared to the cylinders stump? 10 / 18

  14. Decision Tree Example root 197:201 cylinders? 3 4 5 6 8 3:1 20:184 1:2 73:11 100:3 maker? america europe asia 67:7 3:1 3:3 Error reduction compared to the cylinders stump? 10 / 18

  15. Decision Tree Example root 197:201 cylinders? 3 4 5 6 8 3:1 20:184 1:2 73:11 100:3 ϕ ? 0 1 73:1 0:10 Error reduction compared to the cylinders stump? 10 / 18

  16. Decision Tree Example root 197:201 cylinders? 3 4 5 6 8 3:1 20:184 1:2 73:11 100:3 ϕ ? ϕ ’ ? 0 1 0 1 73:1 0:10 2:169 18:15 Error reduction compared to the cylinders stump? 10 / 18

  17. Decision Tree: Making a Prediction root n:p ϕ 1? 0 1 n0:p0 n1:p1 ϕ 2? 0 1 n10:p10 n11:p11 ϕ 3? ϕ 4? 0 1 0 1 n100:p100 n101:p101 n110:p110 n111:p111 11 / 18

  18. Decision Tree: Making a Prediction root n:p ϕ 1? Data : decision tree t , input example x Result : predicted class 0 1 if t has the form Leaf ( y ) then n0:p0 n1:p1 return y ; else ϕ 2? # t.φ is the feature associated with t ; # t .child( v ) is the subtree for value v ; 0 1 return DTreeTest ( t .child( t.φ ( x ) ), x )); n10:p10 n11:p11 end Algorithm 1: DTreeTest ϕ 3? ϕ 4? 0 1 0 1 n100:p100 n101:p101 n110:p110 n111:p111 11 / 18

  19. Decision Tree: Making a Prediction root n:p ϕ 1? Equivalent boolean formulas: 0 1 n0:p0 n1:p1 ( φ 1 = 0) ⇒ � n 0 < p 0 � ( φ 1 = 1) ∧ ( φ 2 = 0) ∧ ( φ 3 = 0) ⇒ � n 100 < p 100 � ϕ 2? ( φ 1 = 1) ∧ ( φ 2 = 0) ∧ ( φ 3 = 1) ⇒ � n 101 < p 101 � ( φ 1 = 1) ∧ ( φ 2 = 1) ∧ ( φ 4 = 0) ⇒ � n 110 < p 110 � 0 1 n10:p10 n11:p11 ( φ 1 = 1) ∧ ( φ 2 = 1) ∧ ( φ 4 = 1) ⇒ � n 111 < p 111 � ϕ 3? ϕ 4? 0 1 0 1 n100:p100 n101:p101 n110:p110 n111:p111 11 / 18

  20. Tangent: How Many Formulas? ◮ Assume we have D binary features. ◮ Each feature could be set to 0, or set to 1, or excluded (wildcard/don’t care). ◮ 3 D formulas. 12 / 18

  21. Building a Decision Tree root n:p 13 / 18

  22. Building a Decision Tree root n:p ϕ 1? 0 1 n0:p0 n1:p1 We chose feature φ 1 . Note that n = n 0 + n 1 and p = p 0 + p 1 . 13 / 18

  23. Building a Decision Tree root n:p ϕ 1? 0 1 n0:p0 n1:p1 We chose not to split the left partition. Why not? 13 / 18

  24. Building a Decision Tree root n:p ϕ 1? 0 1 n0:p0 n1:p1 ϕ 2? 0 1 n10:p10 n11:p11 13 / 18

  25. Building a Decision Tree root n:p ϕ 1? 0 1 n0:p0 n1:p1 ϕ 2? 0 1 n10:p10 n11:p11 ϕ 3? 0 1 n100:p100 n101:p101 13 / 18

  26. Building a Decision Tree root n:p ϕ 1? 0 1 n0:p0 n1:p1 ϕ 2? 0 1 n10:p10 n11:p11 ϕ 3? ϕ 4? 0 1 0 1 n100:p100 n101:p101 n110:p110 n111:p111 13 / 18

  27. Greedily Building a Decision Tree (Binary Features) Data : data D , feature set Φ Result : decision tree if all examples in D have the same label y , or Φ is empty and y is the best guess then return Leaf ( y ); else for each feature φ in Φ do partition D into D 0 and D 1 based on φ -values; let mistakes( φ ) = (non-majority answers in D 0 ) + (non-majority answers in D 1 ); end let φ ∗ be the feature with the smallest number of mistakes; return Node ( φ ∗ , { 0 → DTreeTrain ( D 0 , Φ \ { φ ∗ } ), 1 → DTreeTrain ( D 1 , Φ \ { φ ∗ } ) } ); end Algorithm 2: DTreeTrain 14 / 18

  28. What could go wrong? ◮ Suppose we split on a variable with many values? (e.g. a continous one like “displacement”) ◮ Suppose we built out our tree to be very deep and wide? 15 / 18

  29. Danger: Overfitting overfitting error rate unseen data (lower is better) training data depth of the decision tree 16 / 18

  30. Detecting Overfitting If you use all of your data to train, you won’t be able to draw the red curve on the preceding slide! 17 / 18

  31. Detecting Overfitting If you use all of your data to train, you won’t be able to draw the red curve on the preceding slide! Solution: hold some out. This data is called development data . More terms: ◮ Decision tree max depth is an example of a hyperparameter ◮ “I used my development data to tune the max-depth hyperparameter.” 17 / 18

  32. Detecting Overfitting If you use all of your data to train, you won’t be able to draw the red curve on the preceding slide! Solution: hold some out. This data is called development data . More terms: ◮ Decision tree max depth is an example of a hyperparameter ◮ “I used my development data to tune the max-depth hyperparameter.” Better yet, hold out two subsets, one for tuning and one for a true, honest-to-science test . Splitting your data into training/development/test requires careful thinking. Starting point: randomly shuffle examples with an 80%/10%/10% split. 17 / 18

  33. The “i.i.d.” Supervised Learning Setup ◮ Let ℓ be a loss function; ℓ ( y, ˆ y ) is what we lose by outputting ˆ y when y is the correct output. For classification: ℓ ( y, ˆ y ) = � y � = ˆ y � ◮ Let D ( x, y ) define the true probability of input/output pair ( x, y ) , in “nature.” We never “know” this distribution. ◮ The training data D = � ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) � are assumed to be identical, independently, distributed (i.i.d.) samples from D . ◮ The test data are also assumed to be i.i.d. samples from D . ◮ The space of classifiers we’re considering is F ; f is a classifier from F , chosen by our learning algorithm. 18 / 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend