Applied Machine Learning Decision Trees Siamak Ravanbakhsh COMP - PowerPoint PPT Presentation

Applied Machine Learning Decision Trees Siamak Ravanbakhsh COMP 551 (Fall 2020)

https://scholarstrikecanada.ca

Admin We have created groups for students who are in time zones very different from EST you can use these groups to more easily search for teammates

Admin your input on class format we may use either format depending on the topic questions on Numpy Winter 2020 | Applied Machine Learning (COMP551)

Learning objectives Decision trees: how does it model the data? how to specify the best model using a cost function how the cost function is optimized

Decision trees: motivation pros. decision trees are interpretable! they are not very sensitive to outliers do not need data normalization cons. they could easily overfit and they are unstable image credit:https://mymodernmet.com/the-30second-rule-a-decision/

Notation overview x , y denote the input and labels x = [ x , x , … , x ] 1 2 D we use D to denote the number of features (dimensionality of the input space) (1) (1) ( N ) ( N ) D = {( x , y ), … , ( x , y )} this is our dataset; we use N to denote the size of the dataset and n for indexing for classification problems, we use C for number of classes y ∈ {1, … , C }

Decision trees: idea R , … , R divide the input space into regions using a tree structure 1 K assign a prediction label to each region for classification this is the class label w I ( x ∈ R ) f ( x ) = ∑ k k k for regression, this is a real scalar or vector how to build the regions and the tree? split regions successively based on the value of a single variable test each region is a set of conditions R = { x ≤ t , x ≤ t } 2 1 1 2 4 w 5 w 3 w 1 x 1 x 2

Possible tests next questions: what are all the possible tests? which test do we choose next? Continuous features all the values that appear in the dataset can be used to split Categorical features if a feature can take C values x ∈ {1, … , C } i , … , x ∈ {0, 1} convert that feature into C binary features (one-hot coding) x i ,1 i , C split based on the value of a binary feature alternatives: multi-way split: can lead to regions with few datapoints binary splits that produce balanced subsets

Cost function ML algorithms usually minimize a cost function or maximize an objective function find a decision tree minimizing the following cost function cost function specifies "what is a good decision or regression tree?" R k first calculate cost per region regression cost R k we predict for region ( n ) ( n ) R ) w = mean( y ∣ x ∈ k k mean squared error (MSE) 1 ∑ x cost( R , D ) = ( n ) k 2 ( y − w ) ∈ R ( n ) k N k k number of instances in region k truth prediction cost( R , D ) N k cost( D ) = ∑ k N total cost is the normalized sum over all regions k

Cost function ML algorithms usually minimize a cost function or maximize an objective function find a decision tree minimizing the following cost function cost function specifies "what is a good decision or regression tree?" R k classification cost again, calculate the cost per region ( n ) ( n ) R ) for each region we predict the most frequent label w = mode( y ∣ x ∈ k k misclassification rate ( n )  w ) 1 ∑ x cost( R , D ) = I ( y = k ( n ) ∈ D k N k R k number of instances in region k truth prediction cost( R , D ) N k total cost is the normalized sum cost( D ) = ∑ k N k

Cost function ML algorithms usually minimize a cost function or maximize an objective function find a decision tree minimizing the following cost function cost function specifies "what is a good decision or regression tree?" cost( R , D ) total cost is the normalized sum cost( D ) = N k ∑ k N k problem it is sometimes possible to build a tree with zero cost : build a large tree with each instance having its own region ( overfitting !) example use features such as height, eye color etc, to make perfect prediction on training data solution find a decision tree with at most K tests minimizing the cost function K tests = K internal node in our binary tree = K+1 leaves (regions) Winter 2020 | Applied Machine Learning (COMP551)

Search space K+1 regions objective : find a decision tree with K tests minimizing the cost function R k 1 ( K the number of full binary trees with K+1 leaves (regions ) is the Catalan number 2 K ) K +1 1, 1, 2, 5 , 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190, exponential in K 6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452 we also have a choice of feature for each of K internal node D K x d moreover, for each feature different choices of splitting bottom line: finding optimal decision tree is an NP-hard combinatorial optimization problem

Greedy heuristic finding the optimal tree is too difficult, instead use a greedy heuristic to find a good tree recursively split the regions based on a greedy choice of the next test end the recursion if not worth-splitting R node D function fit-tree( , ,depth) R node D R , R = greedy-test ( , ) left right R , R if not worth-splitting(depth, ) {{ R , R }, { R , { R , R }} left right 1 2 3 4 5 R node return else R left D left-set = fit-tree( , , depth+1) R right right-set = fit-tree( , , depth+1) D return {left-set, right-set} final decision tree in the form of nested list of regions

Choosing tests the split is greedy because it looks one step ahead this may not lead to the the lowest overall cost R node D function greedy-test ( , ) best-cost = -inf for each feature and each possible test d ∈ {1, … , D } R node R , R split into based on the test left right cost( R N right cost( R N left = , D ) + , D ) split-cost left right N node N node if split-cost < best-cost: best-cost = split-cost R ∗ R left = left R ∗ R right = right return R ∗ , R ∗ left right

Stopping the recursion worth-splitting subroutine R node if we stop when has zero cost, we may overfit heuristics for stopping the splitting: reached a desired depth number of examples in or is too small R left R right is a good approximation, the cost is small enough w k reduction in cost by splitting is small N right cost( R N left cost( R cost( R , D ) − ( , D ) + , D ) ) node left right N node N node image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/ Winter 2020 | Applied Machine Learning (COMP551)

revisiting the classification cost ideally we want to optimize the misclassification rate 1 ∑ x ( n )  w ) cost( R , D ) = I ( y = ∈ R k k ( n ) N k k this may not be the optimal cost for each step of greedy heuristic example both splits have the same misclassification rate (2/8) (.5, 100%) R node (.5, 100%) R right R left (.25, 50%) (.75, 50%) (.33, 75%) (1, 25%) however the second split may be preferable because one region does not need further splitting idea: use a measure for homogeneity of labels in regions

Entropy y entropy is the expected amount of information in observing a random variable note that it is common to use capital letters for random variables (here for consistency we use lower-case) C H ( y ) = − p ( y = c ) log p ( y = c ) ∑ c =1 − log p ( y = c ) is the amount of information in observing c zero information if p(c)=1 less probable events are more informative ′ ′ p ( c ) < p ( c ) ⇒ − log p ( c ) > − log p ( c ) information from two independent events is additive − log( p ( c ) q ( d )) = − log p ( c ) − log q ( d ) a uniform distribution has the highest entropy C 1 1 H ( y ) = − ∑ c =1 log = log C C C a deterministic random variable has the lowest entropy H ( y ) = −1 log(1) = 0

Mutual information for two random variables t , y mutual information is the amount of information t conveys about y change in the entropy of y after observing the value of t I ( t , y ) = H ( y ) − H ( y ∣ t ) conditional entropy L p ( t = l ) H ( x ∣ t = l ) ∑ l =1 p ( y = c , t = l ) = ∑ l ∑ c p ( y = c , t = l ) log p ( y = c ) p ( t = l ) this is symmetric wrt y and t = H ( t ) − H ( t ∣ y ) = I ( y , t ) mutual information is always positive and zero only if y and t are independent

Entropy for classification cost I ( y ( n ) ∑ x = c ) ( n ) ∈ R we care about the distribution of labels in each region p ( y = c ) = k k N k 1 ∑ x ( n )  w ) = misclassification cost cost( R , D ) = I ( y = 1 − p ( w ) ∈ R k ( n ) k k k N k k the most probable class w = arg max p ( c ) k c k cost( R , D ) = entropy cost H ( y ) choose the split with the lowest entropy k change in the cost becomes the mutual information between the test and labels ( , D ) ) cost( R cost( R cost( R , D ) − N left , D ) + N left node left right N node N node = H ( y ) − ( p ( x ≥ t )) ) = I ( y , x > t ) t ) H ( p ( y ∣ x ≥ t )) + p ( x < t ) H ( p ( y ∣ x < d d d d this means by using entropy as our cost, we are choosing the test which is maximally informative about labels

Applied Machine Learning Decision Trees Siamak Ravanbakhsh COMP - PowerPoint PPT Presentation

Applied Machine Learning Decision Trees Siamak Ravanbakhsh COMP 551 (Fall 2020) https://scholarstrikecanada.ca Admin We have created groups for students who are in time zones very different from EST you can use these groups to more easily

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Supervised Learning Decision Trees and Linear Models Marco Chiarandini Department of Mathematics

Decision tree learning Aim: find a small tree consistent with the training examples Idea:

Decision Trees Petr Pok Czech Technical University in Prague Faculty of Electrical

Machine Learning I: Decision Trees AI Class 14 (Ch. 18.118.3) Cynthia Matuszek CMSC 671

Foundations of Artificial Intelligence 14. Machine Learning Learning from Observations Joschka

Contents Introduction Linear Regression Generalized Linear Regression Decision Trees with

Decision Trees II Dr. Alex Williams August 26, 2020 COSC 425: Introduction to Machine Learning

Decision Trees CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Credit: some examples & figures

Applied Machine Learning Decision Trees Siamak Ravanbakhsh COMP - PowerPoint PPT Presentation

Applied Machine Learning Decision Trees Siamak Ravanbakhsh COMP 551 (Fall 2020) https://scholarstrikecanada.ca Admin We have created groups for students who are in time zones very different from EST you can use these groups to more easily

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Supervised Learning Decision Trees and Linear Models Marco Chiarandini Department of Mathematics

Decision tree learning Aim: find a small tree consistent with the training examples Idea:

Decision Trees Petr Pok Czech Technical University in Prague Faculty of Electrical

Machine Learning I: Decision Trees AI Class 14 (Ch. 18.118.3) Cynthia Matuszek CMSC 671

Foundations of Artificial Intelligence 14. Machine Learning Learning from Observations Joschka

Contents Introduction Linear Regression Generalized Linear Regression Decision Trees with

Decision Trees II Dr. Alex Williams August 26, 2020 COSC 425: Introduction to Machine Learning

Decision Trees CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Credit: some examples &amp; figures

Decision Trees CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Credit: some examples & figures