Machine Learning (CSE 446): Decision Trees Sham M Kakade 2018 c - PowerPoint PPT Presentation

Machine Learning (CSE 446): Decision Trees Sham M Kakade � 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 18

Announcements ◮ First assignment posted. Due Thurs, Jan 18th. Remember the late policy (see the website). ◮ TA office hours posted. (Please check website before you go, just in case of changes.) ◮ Midterm: Weds, Feb 7. ◮ Today: Decision Trees, the supervised learning 2 / 18

Features (a conceptual point) Let φ be (one such) function that maps from inputs x to values. There could be many such functions, sometimes we write Φ( x ) for the feature “vector” (it’s really a“tuple”). ◮ If φ maps to { 0 , 1 } , we call it a “binary feature (function).” ◮ If φ maps to R , we call it a “real-valued feature (function).” ◮ φ could map to categorical values. ◮ ordinal values, integers, ... Often, there isn’t much of a difference between x and the tuple of features. 3 / 18

Features Data derived from https://archive.ics.uci.edu/ml/datasets/Auto+MPG mpg; cylinders; displacement; horsepower; weight; acceleration; year; origin Input: a row in this table. a feature mapping corresponds to a column. Goal: predict whether mpg is < 23 (“bad” = 0) or above (“good” = 1) given other attributes (other columns). 201 “good” and 197 “bad”; guessing the most frequent class (good) will get 50.5% accuracy. 4 / 18

Let’s build a classifier! ◮ Let’s just try to build a classifier. (This is our first learning algorithm) ◮ For now, let’s ignore the “test” set and trying to “generalize” ◮ Let’s start with just looking at a simple classifier. What is a simple classification rule? 5 / 18

Contingency Table values of feature φ v 1 v 2 · · · v K values of y 0 1 6 / 18

Decision Stump Example maker y america europe asia 0 174 14 9 1 75 56 70 ↓ ↓ ↓ 0 1 1 7 / 18

Decision Stump Example root 197:201 maker? maker y america europe asia america europe asia 0 174 14 9 174:75 14:56 9:70 1 75 56 70 ↓ ↓ ↓ 0 1 1 7 / 18

Decision Stump Example root 197:201 maker? maker y america europe asia america europe asia 0 174 14 9 174:75 14:56 9:70 1 75 56 70 ↓ ↓ ↓ 0 1 1 Errors: 75 + 14 + 9 = 98 (about 25%) 7 / 18

Decision Stump Example root 197:201 cylinders? 3 4 5 6 8 3:1 20:184 1:2 73:11 100:3 8 / 18

Decision Stump Example root 197:201 cylinders? 3 4 5 6 8 3:1 20:184 1:2 73:11 100:3 Errors: 1 + 20 + 1 + 11 + 3 = 36 (about 9%) 8 / 18

Key Idea: Recursion A single feature partitions the data. For each partition, we could choose another feature and partition further. Applying this recursively, we can construct a decision tree . 9 / 18

Decision Tree Example root 197:201 cylinders? 3 4 5 6 8 3:1 20:184 1:2 73:11 100:3 maker? america europe asia 7:65 10:53 3:66 Error reduction compared to the cylinders stump? 10 / 18

Decision Tree Example root 197:201 cylinders? 3 4 5 6 8 3:1 20:184 1:2 73:11 100:3 maker? america europe asia 67:7 3:1 3:3 Error reduction compared to the cylinders stump? 10 / 18

Decision Tree Example root 197:201 cylinders? 3 4 5 6 8 3:1 20:184 1:2 73:11 100:3 ϕ ? 0 1 73:1 0:10 Error reduction compared to the cylinders stump? 10 / 18

Decision Tree Example root 197:201 cylinders? 3 4 5 6 8 3:1 20:184 1:2 73:11 100:3 ϕ ? ϕ ’ ? 0 1 0 1 73:1 0:10 2:169 18:15 Error reduction compared to the cylinders stump? 10 / 18

Decision Tree: Making a Prediction root n:p ϕ 1? 0 1 n0:p0 n1:p1 ϕ 2? 0 1 n10:p10 n11:p11 ϕ 3? ϕ 4? 0 1 0 1 n100:p100 n101:p101 n110:p110 n111:p111 11 / 18

Decision Tree: Making a Prediction root n:p ϕ 1? Data : decision tree t , input example x Result : predicted class 0 1 if t has the form Leaf ( y ) then n0:p0 n1:p1 return y ; else ϕ 2? # t.φ is the feature associated with t ; # t .child( v ) is the subtree for value v ; 0 1 return DTreeTest ( t .child( t.φ ( x ) ), x )); n10:p10 n11:p11 end Algorithm 1: DTreeTest ϕ 3? ϕ 4? 0 1 0 1 n100:p100 n101:p101 n110:p110 n111:p111 11 / 18

Decision Tree: Making a Prediction root n:p ϕ 1? Equivalent boolean formulas: 0 1 n0:p0 n1:p1 ( φ 1 = 0) ⇒ � n 0 < p 0 � ( φ 1 = 1) ∧ ( φ 2 = 0) ∧ ( φ 3 = 0) ⇒ � n 100 < p 100 � ϕ 2? ( φ 1 = 1) ∧ ( φ 2 = 0) ∧ ( φ 3 = 1) ⇒ � n 101 < p 101 � ( φ 1 = 1) ∧ ( φ 2 = 1) ∧ ( φ 4 = 0) ⇒ � n 110 < p 110 � 0 1 n10:p10 n11:p11 ( φ 1 = 1) ∧ ( φ 2 = 1) ∧ ( φ 4 = 1) ⇒ � n 111 < p 111 � ϕ 3? ϕ 4? 0 1 0 1 n100:p100 n101:p101 n110:p110 n111:p111 11 / 18

Tangent: How Many Formulas? ◮ Assume we have D binary features. ◮ Each feature could be set to 0, or set to 1, or excluded (wildcard/don’t care). ◮ 3 D formulas. 12 / 18

Building a Decision Tree root n:p 13 / 18

Building a Decision Tree root n:p ϕ 1? 0 1 n0:p0 n1:p1 We chose feature φ 1 . Note that n = n 0 + n 1 and p = p 0 + p 1 . 13 / 18

Building a Decision Tree root n:p ϕ 1? 0 1 n0:p0 n1:p1 We chose not to split the left partition. Why not? 13 / 18

Building a Decision Tree root n:p ϕ 1? 0 1 n0:p0 n1:p1 ϕ 2? 0 1 n10:p10 n11:p11 13 / 18

Building a Decision Tree root n:p ϕ 1? 0 1 n0:p0 n1:p1 ϕ 2? 0 1 n10:p10 n11:p11 ϕ 3? 0 1 n100:p100 n101:p101 13 / 18

Building a Decision Tree root n:p ϕ 1? 0 1 n0:p0 n1:p1 ϕ 2? 0 1 n10:p10 n11:p11 ϕ 3? ϕ 4? 0 1 0 1 n100:p100 n101:p101 n110:p110 n111:p111 13 / 18

Greedily Building a Decision Tree (Binary Features) Data : data D , feature set Φ Result : decision tree if all examples in D have the same label y , or Φ is empty and y is the best guess then return Leaf ( y ); else for each feature φ in Φ do partition D into D 0 and D 1 based on φ -values; let mistakes( φ ) = (non-majority answers in D 0 ) + (non-majority answers in D 1 ); end let φ ∗ be the feature with the smallest number of mistakes; return Node ( φ ∗ , { 0 → DTreeTrain ( D 0 , Φ \ { φ ∗ } ), 1 → DTreeTrain ( D 1 , Φ \ { φ ∗ } ) } ); end Algorithm 2: DTreeTrain 14 / 18

What could go wrong? ◮ Suppose we split on a variable with many values? (e.g. a continous one like “displacement”) ◮ Suppose we built out our tree to be very deep and wide? 15 / 18

Danger: Overfitting overfitting error rate unseen data (lower is better) training data depth of the decision tree 16 / 18

Detecting Overfitting If you use all of your data to train, you won’t be able to draw the red curve on the preceding slide! 17 / 18

Detecting Overfitting If you use all of your data to train, you won’t be able to draw the red curve on the preceding slide! Solution: hold some out. This data is called development data . More terms: ◮ Decision tree max depth is an example of a hyperparameter ◮ “I used my development data to tune the max-depth hyperparameter.” 17 / 18

Detecting Overfitting If you use all of your data to train, you won’t be able to draw the red curve on the preceding slide! Solution: hold some out. This data is called development data . More terms: ◮ Decision tree max depth is an example of a hyperparameter ◮ “I used my development data to tune the max-depth hyperparameter.” Better yet, hold out two subsets, one for tuning and one for a true, honest-to-science test . Splitting your data into training/development/test requires careful thinking. Starting point: randomly shuffle examples with an 80%/10%/10% split. 17 / 18

The “i.i.d.” Supervised Learning Setup ◮ Let ℓ be a loss function; ℓ ( y, ˆ y ) is what we lose by outputting ˆ y when y is the correct output. For classification: ℓ ( y, ˆ y ) = � y � = ˆ y � ◮ Let D ( x, y ) define the true probability of input/output pair ( x, y ) , in “nature.” We never “know” this distribution. ◮ The training data D = � ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) � are assumed to be identical, independently, distributed (i.i.d.) samples from D . ◮ The test data are also assumed to be i.i.d. samples from D . ◮ The space of classifiers we’re considering is F ; f is a classifier from F , chosen by our learning algorithm. 18 / 18

Machine Learning (CSE 446): Decision Trees Sham M Kakade 2018 c - PowerPoint PPT Presentation

Machine Learning (CSE 446): Decision Trees Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 18 Announcements First assignment posted. Due Thurs, Jan 18th. Remember the late policy (see the website).

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision Trees COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Decision

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Online machine learning with decision trees Max Halford University of Toulouse Online machine

Machine Learning (CSE 446): Introduction Sham M Kakade 2018 c University of Washington

Machine Learning (CSE 446): Probabilistic Machine Learning MLE & MAP Sham M Kakade 2018

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

The Organization of Knowledge ! History of Information i218 ! Geoff Nunberg ! Feb. 16, 2012 ! 1 !

CANDLE FOLLOWER BRASS STAND BRASS CIB IBORIUM CREATOR MUNDI PIE IECES BRASS HOLY WATER STOUP

Case Study: Retrofit Acoustic Treatments in a Heritage Apartment Building Presentation to BECOR

The SBS G En polarized 3 He experiment Some history illustrating the important physics being

Positive m -divisible non-crossing partitions and their cylic sieving Christian Krattenthaler

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

Experiments and open issues on decision procedures theorem proving and software analysis Maria

Ensemble Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November

Machine Learning (CSE 446): Decision Trees Sham M Kakade 2018 c - PowerPoint PPT Presentation

Machine Learning (CSE 446): Decision Trees Sham M Kakade 2018 c University of Washington cse446-staff@cs.washington.edu 1 / 18 Announcements First assignment posted. Due Thurs, Jan 18th. Remember the late policy (see the website).

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision Trees COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Decision

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Online machine learning with decision trees Max Halford University of Toulouse Online machine

Machine Learning (CSE 446): Introduction Sham M Kakade 2018 c University of Washington

Machine Learning (CSE 446): Probabilistic Machine Learning MLE &amp; MAP Sham M Kakade 2018

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

The Organization of Knowledge ! History of Information i218 ! Geoff Nunberg ! Feb. 16, 2012 ! 1 !

CANDLE FOLLOWER BRASS STAND BRASS CIB IBORIUM CREATOR MUNDI PIE IECES BRASS HOLY WATER STOUP

Case Study: Retrofit Acoustic Treatments in a Heritage Apartment Building Presentation to BECOR

The SBS G En polarized 3 He experiment Some history illustrating the important physics being

Positive m -divisible non-crossing partitions and their cylic sieving Christian Krattenthaler

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

Experiments and open issues on decision procedures theorem proving and software analysis Maria

Ensemble Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November

Machine Learning (CSE 446): Probabilistic Machine Learning MLE & MAP Sham M Kakade 2018