lecture 23 decision trees decision trees
play

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS440/ECE448: Intro to Artificial Intelligence Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier juliahmr@illinois.edu http://cs.illinois.edu/fa11/cs440 Decision trees Decision tree


  1. CS440/ECE448: Intro to Artificial Intelligence � Lecture 23: 
 Decision Trees � Decision trees � Prof. Julia Hockenmaier � juliahmr@illinois.edu � � http://cs.illinois.edu/fa11/cs440 � � � Decision trees � Decision tree learning � drink? � Training data D = {( x 1 , y 1 ),…, ( x N , y N )} coffee � tea � – each x i = ( x 1 i ,…., x d i ) is a d -dimensional feature vector � – each y i is the target label (class) of the i-th data point � milk? � milk? � � no � yes � yes � no � Training algorithm: � – Initial tree = the root, corresponding to all items in D no sugar � sugar � sugar � no sugar � – A node is a leaf if all its data items have the same y � – At each non-leaf node: find the feature x i with the highest information gain, create a new child for each value of x i , distribute the items accordingly. � CS440/ECE448: Intro AI � 3 � CS440/ECE448: Intro AI � 4 �

  2. Information Gain � Dealing with numerical attributes � How much information are we gaining by splitting Many attributes are not boolean (0,1) 
 node S on attribute A with values V(A) ? � or nominal (classes) � � – Number of times a word appears in a text � Information required before the split: � – RGB values of a pixel � H(S parent ) � – height, weight, …. � Information required after the split: � ! i ∈ V(A) P(S child_i )H(S child_i ) 
 Splitting on integer or real-valued attributes: � � – Find a split point: A i < " or A i # " ? 
 N S child i � # Gain ( S parent , A ) = H ( S parent ) ! H ( S child i ) � S parent i " V ( A ) CS440/ECE448: Intro AI � 6 � Complete Training Data � Our training data � + - - + + + - - + - + - + + - - + + + - - + - + - - + - - + - + - - + - + - + + - - + + - - - + - + - + + - - + + + - - + - + - + + - - + - + � - - + + + - + - + + - + - + + + - - - + - + - + - - - + - + + + + - + - + - - + - + � - - - + - - + - - - � + + + + � - - + - + - + - - - - - - - - - - - + + + + - + + + � - - - - - - � - - - - - + + + + - - � + + + + + + � - - - - - � + + + + + + � CS440/ECE448: Intro AI � 8 �

  3. The example space � Generalization � We need to label unseen examples accurately. � � But: � The training data is only a very small sample of the example space. � – We won ʼ t have seen all possible combinations of attribute values. � � The training data may be noisy � – Some items may have incorrect attributes or labels � � CS440/ECE448: Intro AI � 9 � CS440/ECE448: Intro AI � 10 � When does learning stop? � The effect of noise � The tree will grow until all leaf nodes have 
 If the training data are noisy, � only one label. � it may introduce incorrect splits. � � � + - + - � If this false value + - + - � � � should have been + - - � + � A2: true A2: false true, we wouldn ʼ t split on A2. � + - - � + � + - � - � + - � - � If this + label should + � - � have been -, we wouldn ʼ t have to split 
 + � - � any further. �

  4. The effect of incomplete data � Overfitting � If the training data are incomplete, 
 The decision tree might overfit the particularities of we may miss important generalizations. � the training data. � � � On training data � full + - + - + + - - - � + - + - + + - - - example 
 + + � + + � space � + - + - � + - + - � Accuracy � training data � On test data � A4 � A4 A2 A2 � + + 
 + + � - - 
 - - � + - + - + � + - - � + - + � Size of tree � + + + + � - - - - - � - - + + � We should have split on A4, not A2. � CS440/ECE448: Intro AI � 14 � Reducing Overfitting in Decision Pruning a decision tree � Trees � 1. Train a decision tree on training data � Limit the depth of the tree � – No deeper than N (say 3 or 12 or 86 - how to choose?) 
 (keep a part of training data as unseen � validation data) � Require a minimum number of examples 
 used to select a split � 2. Prune from the leaves: � – Need at least M (is 10 enough? 20?) � – Want significance: Statistical hypothesis testing can help � Simplest method: � Replace (prune) each non-leaf node whose BEST: Learn an overfit tree and prune, using validation children are all leaves with its majority label. � (held-out) data � Keep this change if the accuracy on validation set does not degrade. � 15 � CS440/ECE448: Intro AI � 16 �

  5. Dealing with overfitting � Bias-variance tradeoff � Overfitting is a very common problem in machine Bias: What kind of hypotheses do we allow? � learning. � We want rich enough hypotheses to capture 
 � the target function f( x ) � � Many machine learning algorithms have Variance: How much does our learned hypothesis parameters that can be tuned to improve change if we resample the training data? � performance (because they reduce overfitting). � � Rich hypotheses (e.g. large decision trees) 
 need more data (which we may not have) � We use a held-out data set to set these parameters. � CS440/ECE448: Intro AI � 17 � CS440/ECE448: Intro AI � 18 � Reducing variance: bagging � Create a new training set by sampling (with replacement) N items from the original data set. � � Repeat this K times to get K training sets. � Regression � (K is an odd number, e.g. 3, 5, …) � � Train one classifier on each of the K training sets � � Testing: take the majority vote of these K classifiers � � CS440/ECE448: Intro AI � 19 �

  6. Polynomial curve fitting � Polynomial curve fitting � Given some data {(x,y)…}, with x, y ∈ R, � f ( x ) = w 0 + w 1 x 1 + w 2 x 2 + ... + w m x m find a function f such that f(x) = y. � m ! w i x i = o i = 0 Task: � find weights w 0 … w m to best fit the data. � o o � o This requires a loss (error) function o CS440/ECE448: Intro AI � 22 � Squared Loss � Accounting for model complexity � We would like to find the simplest polynomial to fit We want to find a weight vector w which our data. � minimizes the loss (error) on the training � data {(x 1 ,y 1 )…(x N , y N )} We need to penalize the degree of the polynomial. � � N ! L ( w ) = L 2 ( f w ( x i ), y i ) We can add a regularization term to the loss which � penalizes for overly complex functions) � i = 1 N � ! ) 2 = ( y i " f w ( x i ) i = 1 CS440/ECE448: Intro AI � 23 � CS440/ECE448: Intro AI � 24 �

  7. Linear regression � Given some data {(x,y)…}, with x, y ∈ R, � find a function f(x) = w 1 x + w 0 such that f(x) = y. � Regression � o o o o o Linear regression � We need to minimize the loss on the training data: w = argmin w Loss(f w ) � � We need to set partial derivatives of Loss(f w ) � with respect to w1, w0 to zero. � � This has a closed-form solution (see book). �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend