COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

D ECISION T REES

D ECISION T REES A decision tree maps input x ∈ R d to output y using binary decision rules: ◮ Each node in the tree has a splitting rule . ◮ Each leaf node is associated with an output value (outputs can repeat). Each splitting rule is of the form x 1 > 1 . 7 h ( x ) = 1 { x j > t } for some dimension j of x and t ∈ R . ˆ y = 1 x 2 > 2 . 8 Using these transition rules, a path to a leaf node gives the prediction. y = 2 ˆ y = 3 ˆ (One-level tree = decision stump )

R EGRESSION T REES Motivation : Partition the space so that data in a region have same prediction Left: Difficult to define a “rule”. Right: Easy to define a recursive splitting rule.

R EGRESSION T REES − → If we think in terms of trees, we can define a simple rule for partitioning the space. The left and right figures represent the same regression function.

R EGRESSION T REES − → Adding an output dimension to the figure (right), we can see how regression trees can learn a step function approximation to the data.

C LASSIFICATION T REES (E XAMPLE ) Classifying irises using sepal and petal 6 measurements: ◮ x ∈ R 2 , y ∈ { 1 , 2 , 3 } 5.5 ◮ x 1 = ratio of sepal length to width 5 petal length/width ◮ x 2 = ratio of petal length to width 4.5 4 3.5 3 2.5 2 1.5 2 2.5 3 sepal length/width

C LASSIFICATION T REES (E XAMPLE ) Classifying irises using sepal and petal 6 measurements: ◮ x ∈ R 2 , y ∈ { 1 , 2 , 3 } 5.5 ◮ x 1 = ratio of sepal length to width 5 petal length/width ◮ x 2 = ratio of petal length to width 4.5 4 y = 2 ˆ 3.5 3 2.5 2 1.5 2 2.5 3 sepal length/width

C LASSIFICATION T REES (E XAMPLE ) Classifying irises using sepal and petal 6 measurements: ◮ x ∈ R 2 , y ∈ { 1 , 2 , 3 } 5.5 ◮ x 1 = ratio of sepal length to width 5 petal length/width ◮ x 2 = ratio of petal length to width 4.5 4 x 1 > 1 . 7 3.5 3 2.5 2 1.5 2 2.5 3 sepal length/width

C LASSIFICATION T REES (E XAMPLE ) Classifying irises using sepal and petal 6 measurements: ◮ x ∈ R 2 , y ∈ { 1 , 2 , 3 } 5.5 ◮ x 1 = ratio of sepal length to width 5 petal length/width ◮ x 2 = ratio of petal length to width 4.5 4 x 1 > 1 . 7 3.5 3 2.5 y = 1 ˆ y = 3 ˆ 2 1.5 2 2.5 3 sepal length/width

C LASSIFICATION T REES (E XAMPLE ) Classifying irises using sepal and petal 6 measurements: ◮ x ∈ R 2 , y ∈ { 1 , 2 , 3 } 5.5 ◮ x 1 = ratio of sepal length to width 5 petal length/width ◮ x 2 = ratio of petal length to width 4.5 4 x 1 > 1 . 7 3.5 3 2.5 y = 1 ˆ x 2 > 2 . 8 2 1.5 2 2.5 3 sepal length/width

C LASSIFICATION T REES (E XAMPLE ) Classifying irises using sepal and petal 6 measurements: ◮ x ∈ R 2 , y ∈ { 1 , 2 , 3 } 5.5 ◮ x 1 = ratio of sepal length to width 5 petal length/width ◮ x 2 = ratio of petal length to width 4.5 4 x 1 > 1 . 7 3.5 3 2.5 y = 1 ˆ x 2 > 2 . 8 2 1.5 2 2.5 3 sepal length/width y = 2 ˆ y = 3 ˆ

B ASIC DECISION TREE LEARNING ALGORITHM y = 2 ˆ x 1 > 1 . 7 x 1 > 1 . 7 − → − → ˆ y = 1 y = 3 ˆ y = 1 ˆ x 2 > 2 . 8 ˆ y = 2 ˆ y = 3 The basic method for learning trees is with a top-down greedy algorithm. ◮ Start with a single leaf node containing all data ◮ Loop through the following steps: ◮ Pick the leaf to split that reduces uncertainty the most. ◮ Figure out the ≶ decision rule on one of the dimensions. ◮ Stopping rule discussed later. Label/response of the leaf is majority-vote/average of data assigned to it.

G ROWING A REGRESSION TREE How do we grow a regression tree? ◮ For M regions of the space, R 1 , . . . , R M , the prediction function is M � f ( x ) = c m 1 { x ∈ R m } . m = 1 So for a fixed M , we need R m and c m . Goal: Try to minimize � i ( y i − f ( x i )) 2 . 1. Find c m given R m : Simply the average of all y i for which x i ∈ R m . 2. How do we find regions? Consider splitting region R at value s of dim j : ◮ Define R − ( j , s ) = { x i ∈ R | x i ( j ) ≤ s } and R + ( j , s ) = { x i ∈ R | x i ( j ) > s } ◮ For each dimension j , calculate the best splitting point s for that dimension. ◮ Do this for each region (leaf node). Pick the one that reduces the objective most.

G ROWING A CLASSIFICATION TREE For regression : Squared error is a natural way to define the splitting rule. For classification : Need some measure of how badly a region classifies data and how much it can improve if it’s split. K-class problem : For all x ∈ R m , let p k be empirical fraction labeled k . Measures of quality of R m include 1. Classification error: 1 − max k p k k p 2 2. Gini index: 1 − � k 3. Entropy: − � k p k ln p k ◮ These are all maximized when p k is uniform on the K classes in R m . ◮ These are minimized when p k = 1 for some k ( R m only contains one class)

G ROWING A CLASSIFICATION TREE 6 Search R 1 and R 2 for splitting options. 5.5 1. R 1 : y = 1 leaf classifies perfectly 5 2. R 2 : y = 3 leaf has Gini index petal length/width 4.5 � 1 � 50 � 50 � 2 � 2 � 2 4 u ( R 2 ) = 1 − − − 101 101 101 3.5 = 0 . 5098 3 2.5 m & R + Gini improvement from split R m to R − m : 2 1.5 2 2.5 3 sepal length/width � � m · u ( R + u ( R m ) − m · u ( R − m ) + p R + m ) p R − x 1 > 1 . 7 m : Fraction of data in R m split into R + p R + m . u ( R + m ) : New quality measure in region R + ˆ y = 1 ˆ y = 3 m .

G ROWING A CLASSIFICATION TREE 6 Search R 1 and R 2 for splitting options. 5.5 1. R 1 : y = 1 leaf classifies perfectly 5 2. R 2 : y = 3 leaf has Gini index petal length/width 4.5 � 1 � 50 � 50 � 2 � 2 � 2 4 u ( R 2 ) = 1 − − − 101 101 101 3.5 = 0 . 5098 3 2.5 Check split R 2 with 1 { x 1 > t } 2 1.5 2 2.5 3 sepal length/width 0.02 reduction in uncertainty x 1 > 1 . 7 0.015 0.01 0.005 ˆ y = 1 y = 3 ˆ 0 1.6 1.8 2 2.2 2.4 2.6 2.8 3 t

G ROWING A CLASSIFICATION TREE 6 Search R 1 and R 2 for splitting options. 5.5 1. R 1 : y = 1 leaf classifies perfectly 5 2. R 2 : y = 3 leaf has Gini index petal length/width 4.5 � 1 � 50 � 50 � 2 � 2 � 2 4 u ( R 2 ) = 1 − − − 101 101 101 3.5 = 0 . 5098 3 2.5 Check split R 2 with 1 { x 2 > t } 2 1.5 2 2.5 3 sepal length/width 0.25 reduction in uncertainty 0.2 x 1 > 1 . 7 0.15 0.1 ˆ y = 1 y = 3 ˆ 0.05 0 2 2.5 3 3.5 4 4.5 t

G ROWING A CLASSIFICATION TREE 6 Search R 1 and R 2 for splitting options. 5.5 1. R 1 : y = 1 leaf classifies perfectly 5 2. R 2 : y = 3 leaf has Gini index petal length/width 4.5 � 1 � 50 � 50 � 2 � 2 � 2 4 u ( R 2 ) = 1 − − − 101 101 101 3.5 = 0 . 5098 3 2.5 Check split R 2 with 1 { x 2 > t } 2 1.5 2 2.5 3 sepal length/width 0.25 reduction in uncertainty 0.2 x 1 > 1 . 7 0.15 0.1 y = 1 ˆ x 2 > 2 . 8 0.05 0 2 2.5 3 3.5 4 4.5 t y = 2 ˆ y = 3 ˆ

P RUNING A TREE x 2 Q : When should we stop growing a tree? A : Uncertainty reduction is not best way. Example : Any split of x 1 or x 2 at right will show zero reduction in uncertainty. However, we can learn a perfect tree on x 1 this data by partitioning in quadrants. Pruning is the method most often used. Grow the tree to a very large size. Then use an algorithm to trim it back. (We won’t cover the algorithm, but mention that it’s non-trivial.)

O VERFITTING error true error training error number of nodes in tree ◮ Training error goes to zero as size of tree increases. ◮ Testing error decreases, but then increases because of overfitting .

T HE B OOTSTRAP

T HE B OOTSTRAP : A R ESAMPLING T ECHNIQUE We briefly present a technique called the bootstrap . This statistical technique is used as the basis for learning ensemble classifiers . Bootstrap Bootstrap (i.e., resampling) is a technique for improving estimators. Resampling = Sampling from the empirical distribution of the data Application to ensemble methods ◮ We will use resampling to generate many “mediocre” classifiers. ◮ We then discuss how “bagging” these classifiers improves performance. ◮ First, we cover the bootstrap in a simpler context.

B OOTSTRAP : B ASIC ALGORITHM Input ◮ A sample of data x 1 , . . . , x n . ◮ An estimation rule ˆ S of a statistic S . For example, ˆ S = med ( x 1 : n ) estimates the true median S of the unknown distribution on x . Bootstrap algorithm 1. Generate bootstrap samples B 1 , . . . , B B . • Create B b by picking points from { x 1 , . . . , x n } randomly n times. • A particular x i can appear in B b many times (it’s simply duplicated). 2. Evaluate the estimator on each B b by pretending it’s the data set: ˆ S b := ˆ S ( B b ) 3. Estimate the mean and variance of ˆ S : B B µ B = 1 B = 1 � ˆ � (ˆ σ 2 S b − µ B ) 2 S b , B B b = 1 b = 1

COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University D ECISION T REES D ECISION T REES A decision tree maps input x R d

Introduction to machine learning COMS 4721 Learning from data Machine learning : the study

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 14, 3/21/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 3, 1/24/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 15, 3/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 4, 1/26/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 5, 1/31/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 11, 2/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 8, 2/14/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 17, 3/30/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 21, 4/13/2017 Prof. John Paisley Department

Feature Extraction Aleix M. Martinez aleix@ece.osu.edu Continuous Feature Space Let us now

JUST THE MATHS SLIDES NUMBER 14.7 PARTIAL DIFFERENTIATION 7 (Change of independent

Motivation systemfit systemfit Many theoretical models consist of more than one equation Arne

2 RFC 4650 - HMAC-Authenticated Diffie-Hellman for Multimedia Internet KEYing (MIKEY) RFC

Variation in Evidence and Simpsons Paradox Corey Dethier University of Notre Dame Philosophy

Qubes OS R2 Tutorial INVISIBLE THINGS LAB LINUXCON EUROPE, OCT 2014, V1.0-RC1 2 Agenda Part 1

Seamless Network-Wide IGP Migrations Laurent Vanbever, Stefano Vissicchio, Cristel Pelsser,

Minimal presentations of shifted numerical monoids Christopher ONeill Texas A&M University

Sambuz

Useful Links

Newsletter

Mail Us

COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University D ECISION T REES D ECISION T REES A decision tree maps input x R d

Introduction to machine learning COMS 4721 Learning from data Machine learning : the study

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 14, 3/21/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 3, 1/24/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 15, 3/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 4, 1/26/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 5, 1/31/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 11, 2/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 8, 2/14/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 17, 3/30/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 21, 4/13/2017 Prof. John Paisley Department

Feature Extraction Aleix M. Martinez aleix@ece.osu.edu Continuous Feature Space Let us now

JUST THE MATHS SLIDES NUMBER 14.7 PARTIAL DIFFERENTIATION 7 (Change of independent

Motivation systemfit systemfit Many theoretical models consist of more than one equation Arne

2 RFC 4650 - HMAC-Authenticated Diffie-Hellman for Multimedia Internet KEYing (MIKEY) RFC

Variation in Evidence and Simpsons Paradox Corey Dethier University of Notre Dame Philosophy

Qubes OS R2 Tutorial INVISIBLE THINGS LAB LINUXCON EUROPE, OCT 2014, V1.0-RC1 2 Agenda Part 1

Seamless Network-Wide IGP Migrations Laurent Vanbever, Stefano Vissicchio, Cristel Pelsser,

Minimal presentations of shifted numerical monoids Christopher ONeill Texas A&amp;M University

Sambuz

Useful Links

Newsletter

Mail Us

Minimal presentations of shifted numerical monoids Christopher ONeill Texas A&M University